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Foreword 



- »i *■ 

Overahe last few years, there has been widespread debate on various 
cphcefns arid issues surrounding testing. The participants in the current 
debate .include not only people with expertise in me«isurement or 
responsibility for giving tests and interpreting their results, but also the 
media, unions, ethnic groups, those who take the tests, professional 
associations, the courts, and the general public. Given the cross currents 
and contradictions, il seemed appropriate jo provide a platform for 
individuals who have been prominent in thctprofessional associations 
relating to educalijioal measurement arixl research to present their 
views of thejssues, the ev idence with regard lo them, and some possible 
ways to solve tKem. 

The 1976 ETS Invitational Conference served as such a platform, and 
the speakers discussed issues relating to testing as well as some changes 
in testing practices. Their respecti\e paper^ addressed past and present 
events in the testing scene, test theory in evaluation and design of tests: 
purposes of tests and wa)s in which tesf results are presented, inter- 
preted, and used, aspects of testing and related practices that affect the 
student; and dilferent types of decisions for which information pro; 
vided by testing may be relevant. 

We are indebted to all of the speakers for sharing botir their positi\e 
andcritical views of the role of measurement m education and societ\. I 
should like to thank William Raspberry . a columnist at The Washington 
Po^i. for his candid luncheon speech in which he presented his views on 
thexHirrent attacks on standardized tests. 



Will'umh W. Turnhull 

PRESIDI-NT 
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The ETS Award for Distinguish^ Service to Measurement was estab- 
hshed in 1970, lo be presented annual!) to an individual whose work 
and career have had a major impact on developments in educational 
and psychological measurement. The 1976 Award was presented at the 
confercHce by ETS President William W. Turnbull to Dr. Ralph 
Winfred Tyler with this citation: 

For fully half a centur) Ralph T)ler has prodded education and 
educational measurement to become both more flexible and more 
focused, challenging us to conceptualize and assess those qualities 
that are hard to reach and hard to measure but are easil) pro- 
claimed as important goals of education. As Director of Evalua- 
tion of the rhonumental Eight-Year Stud), he helped to shift edu- 
cation in thiseountr) from a narrow conception of subject-matter 
learning to a broaifer conception of growth and development of 
. individuals, from a restrictive reliance on information, knowledge, 
and i>kills to an encompassing awareness of attitudes* apprecia- 
tions, interests, and personal-social adaptability. B) continuouslv 
emphasizing the functions of measurement in improving instruc- 
tion, he helped to open both curriculum design and educational 
evaluation, to a wide range of specific objectives and outcomes 
formerly lost in vague rhetoric. 

As creator apd chief architect of the National Assessment of 
Educational Progress, he developed the financial, organizational, 
and political arrangements needed to make that massive and con- 
troversial concept into a practical and esteemed realitv, while at 
the same time shaping its technical components to pioneer in the 
application of objectives-referenced measurement and criterion- 
referenced interpretation at the item level. 

As Director of the Center for Advanced Studv in the Behavioral 
Sciences at Stanford. Caliiorma, he fostered an atmosphere both 
challenging and supportive in which creative scholarship and 
interdisciplinary interplay flourished. There, during fifteen years 
as administrator, colleague, raconteur and wit. he personally in- 
fluenced the development of hundreds of distinguished behavioral 
scientists. 

For his many contributions to the theory and practice of educa- 
tion, educational measurement and evaluation, and for his pro- 
ductive career as teacher and administrator. ETS is pleased lo 
present the 1976 Award for Distinguished Service to Measurement 
to: Ralph Winfred Tyler, 

b 

9 

^ »» 




Previous Recipients of the 
ETS Measurement Award 

1970 £. R Lindquist 
J 971 Lee J, Cronbach 
Am Robert L Thorndike 

1973 . Oscar L Euros 

1974 J: p: Guilford 

1975 Harold Gulliksen 
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Chap$ and 6dii^ 



>^hen a scene is characterized by "chaos'' and "controversy," it is reason- 
,^bieao^assUme^^^ past events have contributed to the jdisturbing ^ 
iituaji^ I deem it ap- 

;propriate to rcvievy some events of the past two decades that may help 

.iBefpfcprw^ I think it wise to exjpiairi^hbw I con- 

^duj^ted-iW As a result ot^administering i -testing program in^i^ 

Jargii^hqp^^^^^^ for a number of years, I have filc^of fai folders con- 

4toitiiiig m^ articles, some cohveption pfograrns, and various 

;|iublicalip^^^^ important enough to survive several rounds 

.bf clpnl^^^ additional lhave several bookshelyes ofMrd- 

:backs^atid^pa^^^^^ tb^4cstihg. This miscellaneous collec- 

provKiedhhe sources for this reVfe^ Obviously. I make n6 claim 



tipniprpvidc^^^^ for this revfeM(^bviously. 

fp^the c^^ of the collection or^ ijie review, but I hope you 

wiljagr^^^^^ have gleaned some interesting j^d. I hope; perim^nt 

finfpm^ * ^N. - . 

.At thelp^^^^ ^believe it is appropriate to establish thi^cl^ marking 

the bVgihn^^^^ testing cpntfpversy.f belteve^c^n say it 

-Was not sp^^^ ago, oh October 4. 1957, the day the Russilin^ent 

jktoahVsky^^^ At first. the^-American jpei^^ 

reacty w^^ disbelief that another nation ajjfeafed to be 

.>wi£nmgjthe s^ As soonis they tried to assess why our nation 

jagged,b3iM immediately began to look firilically at the quality 

ind^achieyem^^^^ Within a year,,CongFfiss passed the 

cNation^apDefe^ Education Act (N DEA) which provided /unds for 

niany^^^ to establish extensive testing programs. Accprd- 
Jnglyi theadmimstratipn ofstandalfjlized tests expanded at a rapid rate. 

that sarhe>y^^ 1958. a move by the Natiojial Merit Scholarship- 

-;Cpfppr3fip^ a problem to the schoolsl When the Scholarship 
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[ :prpgrani,wa earlier, testing was limitedUo the 

'Uppcrfiyep^^^ seniprs.The Scholarship Corporation changed 

. .;its^tcst4publ!s|i^ Testihg.Service.(Ef S) to Sdenqe 

seniors in the Xill to 

spnri|/anj:suggested::that th^ 
/ 'Students to, re were not com pe 

fbr^hpiarships: Sto^ thereafter, ETS.begah publishing the Preliminaiy 
:>piplast^^^^^^^^ (PSAt) which was offered to ;high^^^^^^^ 

jUm^ annual Vmeetihg in the spring of 1958; the 

secpndaty 

Tfifera ti^ tests fpr high school students. The next year; Nlartih: 

. E5sey,|pres^^^ pf the American Associatipn of Sch^^^ 

.(AAS^^ a nirie-meft^er committee to study the problems in 

:tes\|^g:arid;5ent a ^upenntehde^^^^ 
.Meahwjhi,!^^ for scoring and 

^ /proldessing impassive numbers *o Wests at Avhal:then^^^w^ altnost un- 

^tejreyabje.sjpeed. Simultaneously, a second^ adigission, 
. ahe^Jieri^^^^^^^^ Tests (ACT);had been dev appeared 

aJuW^;initime-f^ what had come td.be known as the "college 

- Idmissip^^^ in mfn was resuUrrig from.t^^^ 

V }AKtlreix;annuai m^ February i960vthe Natiohar 
/ jvje^asurcn^^^ (N^CME)^'' and the Amen 

^ . Rese^rc^^^ symposium dfi five tea^^^^ 

. edu^a^t^brs.^^W^^ addressed :them5.elves to the tppic "Resistance to Tesk 
.ihgFf h^ vists of aptitude and . achie^ discussed; as 

wefUa^^^^^^^^^ problem pf who would be-eliminated by^ / 
^ Jri:^iarclul96^ to 446;bW students in 1^353 second- 

^ aryilbhob^^^ cpunti7. ;lt was the comprcH^ ^ 

.day battery of te part of a'large-scale; long-range lesearch 

studyTkhpwn as Pirpject TALENT^ the study was being conducteC^by 
. \theiA^^ for Research and supportcd:by fUrids from the 

V xU iSSOl c"f ^^^^ ' . ^ 
' ' 5pfey.S/.0ffi^^ 

creasing: cntid of tests as evidenced by their publication 
* ' s^ahd^n^^^^ McLaughlin. The foreword^ 

* iLawfence^D^^ form. 

pf arf^open letter to;parents arid teachers who were assured that T^tle V 
. ..pjC^iypEAvW^^^^ 

_i Ifi: giiick succession ,\there appeared several paperbacks andihaxiis 
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Joan Bollenl^cher 



^backs:re|a«^^ 

iStrdiegf 'o^^ Tq/5'* 'by Dar^^^^^ Huffi A short timc jater came - 

Meetmg0^^^ Martin KatZvand Benjamin. 

Shlriiberglthe^^^^ The Tyranny pf Tesifng'\ 

^<:hauricey%\ri^^^ lis Place in EditcdM^ and 

.GeheHav^^ 

%hUe all th^testi^^^ and discussion were going on in the high schools 
.and coijeges^he.e also had some questions 

.WithnHe^.^ij^V issues of the Naiiohai Elapehiar^^^ Prihcipqi 

:(Sfipl^,inber/^ 1961) were deyoled4o-,educatipnal 

.mea^uremen tr-on^ :pUrposes andctechniques- and^the other 19 yjil- 
Vef^r^mtjoh and life the t\yo recent is§ues^orth^Pr/wc/pfl7 

.devp^^^^ to ^ standa rd ized ' testirig. the 1 96 1 issues featu red a group ^6f 
^authors who w^^ a 'Vhp's wtio" in ihe testing fields 

/iniWmcantim^^^ were increasing iniimbiirigs and gruniblings by^ 
:higKe§c^^^ and/their pafehts a^^^ the riuiiibefs of tests re- 

.quired^qh^ 1^'?^^^^- 
THeir proje^^^ 1962^of Testjng, Testing, 

Tes^t^^^^ papefTboundDook prepared b)*a Joint^Gommittee 

pnifehma appointed by three natienaj associatipns^the school admiiV 
^istratpfs^^he.c^^^^^^^ officers, and the secondary school prin-^ 

cipals/J^l^^^^^^ catiseji shock waves lip and down the testing world. 
tfew;qup^^^^^ * ^ * 

ThJstandardized test isCat be&iari ad hoc device; therefore. itsTunctioii is 
.limltedJn compahson with the scope and duration of experiences to which 
a j^umafl being IS .suHjected during hi> Iifetiine. the standardized test is a 
loiV-ortlerfiuf^^^ 

* „ ;Like.tnodi:rn.w^^ Jrug>. standardized tests have captured the pub- 
^%mind:!" - 

/..Mosi.tjist makers are more or less candid about the limitations of 
.slandardized |e^ it is a mistake to assume that their knowledge and 
.'^cstra^ni Iiave4>een appreculc(5 b\ the public, or for that mayer.'even by 
^manveciucatoifs. • ^ . 

As i.reread this litl|c book j thought a Jot.of lime and ^fTprt could have 
becn.saved^iD^ cfkics of recerfl years had reprinted Te^/m^. Tesiing, 
festing^lucqndcnscd in -32 pages most of the criticisms contained in 
-scyeral Icngthyjccenl p^ 

^lUmst ihefpregomg list of events and publications provides enough 
evidence that criticism of tests is not a recent phenomenon. .Now let us 
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comA^^^ gf nationanmportahce which haye^hadra sigr 

n|fi<^nt:eff^^^^ on testing) added new dimensions to the criticisms 
" beyond the^ 

In-I^/Cpngress passcdj the CiyitRights Act; Subsequently a niimr 
d^suits on the isisiie of segfegatilon have been filed Jn the federal;. _ 
Gpurts^^here tcs^^ ©6 standardized tests was involved: A ffedefa| 
agency, 'the Eq^^^^ 
wastformod and.^^ 

ejpp^^^^ usingleste foir seiecting em 

llye.be^^^^^ ihe^FederarCourts dh the issue of di«:nihiriatjon in, 

* ihje u^ of te^^ and se/^^ 

l?st^;g^me^^^ fo^ihe defense/lTie sex bias^^^^^^^ 

: ^1972: - : - ^ ^ ^ V . 

In;^19^5,.the year following the Gyil Rights. Act, 
: thelgle^^ Act (ESEA). Titlej,bfai{^ : 

actfpro^^^ funding for the eduration of^ the^ d^^^ 

* T!?^ evaluation requiref^^^ exterisW^^ 

!^of:Standk^^^^ 

drea to massive bvertesting. Nine.years jater,_im 1974, 

thei^wcAo^ 

^ feeing jesjfs,^^ direct result of the Title levaluatibn problems. 

^s the Cqjemah Report, was published; Ma^^^ of its conciusions^.which 
JrofwW^^ affected the piiblic schck)Is, were based dnahe results of 
/ stan3^^ ^ * - . 

'-^NTeanwW^^ as these national events were taking place, back.atithe 
A^mVric^n: Psychoiogical Associati6n:(APA) a committee composed^ of 
eigfitmemb^^ from AR\, AERA, arid 
of work and published the Slcndqrds for Edtjcationdj a 

By 1967, planning^ for the NationakAssessment of Educational 
Pro|ress (NAE^^ 
school admini 

zihes a^d newspapers had articles on the subject, with the Ne\y York 
pni/s otFc^^ 12 :calling;Natiphai:Assessmeni "one of^t^ 
ihqtly^cpntested issu^ in American education;'' 

That 5amc, year, the College Entrahce:Exiimination Board (GEEB) 
'' ^ ajifwjrttp a .21-m^^ to reyiew.;the 

'';C^\1Cpflege BdS^^ to consider possibilities for fiinda- 

^V; \ -- -- _- ' *? . . 

; Id ' : 



Joan Bojtenbachejr- 

.menlallcha^^ in tesis anduhdf use, ahd lo jnake recpmnien 
-:accordingly/.Tlie Gom reports .was issued ihfee/years later, 

/Bacipipw to Mayt25^ 1 96?; the date.of the Gohference on the Ethica^ 
andXegarA^ Recprd Keeping conye by.the Russell 

S^ezFoundatipn^ report.of the Gpnference resulted in the publica- 

/ion,of:a.s«|>^ 

^semination^^^o^^^ in turn provided basic infpma- 

^^ijpn^fcrlK^ Rights and Privacy .Aci^ known as:the 

J^Buckley: Affi^ Gpngress in 1974. ;hl6 longer could 

ielfsco^^^^ 

f!/pwJ?Ks:take another lopKat tire late 6ps, alinie of student rebellion 
.incihe.i^llege^^^^ uniyefsltfes which was reflected .in the scjioqjs, 
/Tcachers;:rep^ 

resisted them them,. Just as 'Schppl| 

- were tryjng,tacpp^ 

,ite ug^- hdid^ ^970.:it.was:hard to find a educational cpnference 
,yhal.did^nptiha^^^^ sessipn on ,accouotability. At most of 

:tHem ,:ihere.was a.discussipii of the lises and limitations of standardized; 
-tests in:ineeting accduntability demands^ Those who. predicted trpubie 

Aihead^xye^^^^^ 

-^^:QlL3?^aientjn^^^ 1971. the Ue\\' York Times reported **in a 

fiisipricymoye 'A York) board (of education) announced4hat. 

U woyi3 establish p^^^ the schools and their sta^ffs 

accouhtabld^fpr their su^^ in educating childrenr The New Yorh 
Jime^^does..!^^^^ terms Jike "historic, move," even on Valen- 

ytme*s The article reported thai the mqye was 'supported by 

f Ajbert SHanker ofahe AmeY Federation of Teachers (AFT), those 
,wKp, foljpw.eve^ New York jchpols will be interested in the 

Ireppft. ^'Security in a Citywide testing Program,'* by Anthony J. 
Polemeni. :pu^^^^^ National Council on Measurement iii 

lEducatipn"^. " 

Just.a^ yejir: after the^ A^w TOYev article, 650 members of 4he 
^NalionaUEduc^^^^^ Association (NEA) who met at the annual NEA 
Conference Rights called for^n immedmte mora- 

lonum on stand^^ There are those who would say that 

Irom'thefe on ^lasbeeri dowhhiU ali the way. 

i^;lHe tijrne th NEA wascgllingfora moratprium, APA, AERA,and 
NGM^ were working on the revision of the 1^66 Standards. A ifectipn 
.on 5Stan^^^^^ the Use ofTests** wa^ added to the publication. After 



.seyefai had^beeh cpmpie,^^^^ NEA; was 

asiTed^fo^ The NEA representatives f^^^^ being. 

4S¥diaftet^ declined to participate/ The reyjsed 

SSigndar^^^^ .published in;! 974JfearV however, that the document 
umits;pVes^^^^^ form.has not had w circulation ^yond^piyc 
Jnd students in. cja^^^ in educational ni6^^^ As I understand it, 

csome transiaU^^^ / 

iHowJEwou^^^^^^^ of the past; eighteen 

,m6jjithS:twh^ myjudgriient, Jia^ it almost impossible ;fqr 

^perso^^^^^ ^A'ho have responsibilities for testing to cope with 

^the)r(2suulng ch^^ Fpr openers, there was the March/ 

Apfp-^^^^^ National Elementary. Principal ihc ^fflll?! 

i^pubHcation of the National . Elementary Pfihcjpals Association; The 
:6bycf:Y^^^^ "IQ, the. Myth of Measurability:" Most^of the 

. J6 afticfes^^w Houts, editor ofthe maga?j(ji,e, c^^^^ 

"anjnllensi\ie;natig^^^ standardized .testing"**. TTie July/ 

Augi^tJ^^^ a devastatingraltack on 

iesU^^^ as a blast at the National' Assessme^^ 

Jrdgre^^^^^ assessment; that "asks .powerless communities -to. 0e^^^ 
ahemselv^ the powerful.- the lead - e^itqri§l 

.state|\tl^^^ ":;.it is^ow imperative for the ^ducaUon pit)fession to take 
.tFeJftii^^^^ the current tests, testing 

miist^^^^^ education , prpfessioh itself " the . ed|tor ai^^^^^ 

.cailedlfbr irn cessation of the practice . o 

^tplthejpre^^^^ ^ , . ; 

The September/October issue pf the magazine contained fou^ letters 
4pithe:.iditor apprbvin "IQ Jssue/' but one:letter 
^Herbeit^^^ of -Michigan Slate University registered yiplerit 

exception - R^^ issue,/he said, '*We :had 

_ ^profe^sofs of physics,. anim 
Kowhere did I find;a whose special conSpetericy, training, arid 

^experience qualified to address as complex an issue as stan- 
:Q^Mzed"^^^^ ' ^ ^ 

xBetweeh 'the publicatipn pf the two issues q{x\st National Elemem 
. BriMpdJ;^ there appeared X new critic of the tests, .the consumer 
adfvocate.lln^^^^ May 1975 Ladies Home Journal, of ^^^^^ , 
ahdrc.Was an article in vyhich Ralph Nader called for citizens^ whose 
iives^ar^^^^^^ the power of ETS to call to account the testers and^ 

the instifutioris that support them." 
. / Before mpsi^iementary school principals had had time to i-cad their 
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magazine, an interesting event occurred in ATcrorij Ohio, in late Septemr 
:ber. The Alcron Boa^^^ pf Education was de^ a naoUdn for a; hew 
vfrial^'ahdUfi^^^^ won the suit forcing the Akrpri 

^Boardlto release sch6ol-by-$chpbl test resujts to the press;.therefpr^^ 
.ppsition pfthe editor of ihc{NgtiqhaJ El^^^^^^ relative to 

^the Felease^^^^ lipt upheld by the court. 

^AjxQther cjyen^ matters in the schools. 

The Gpijege^^^^ average, scores attaiiied by the 

1975 high^ on the Scholastic Aptitude fest,(SA'^^^^ 

thet-lbwest ever, jt.wds.npted also that more women than rneh had.t^ 
theCtesu 1^^^ SAT and. also on Jhe 

Goliege T^^ matter Qrcontihuing..concern,;tp the 

?|?entVt|yleairHer th^^ 

JO^^perts from test organizations andcoll'e^^ to try to 

define 4he4p.roble^^^ .biit with little success. The College Board also 
appointeda sor^caile^^ ribbbn:paneP to study the problem. 

Lcanii^^^^^^ mentioning an article which appeared this past 
Eebruary^ in :the CincinhativEhquirer/U.q^ yice 
.pje^ident;of4 Ammcan College T^^^^ as saying that the 

decjin^o^rcpile^^ S:ores may.be partly due tp an increas^e^ 

ih."medipcre,cQll^ female studetits." Medipcre; indeed! \ 

JNpw .we cpme taOctober 4, 1975; James J, Kilpatnck in his syndi- 
catedrcplu^^^ the i^sue of the Nanonal Elemenja^^^ 

" Por: a)va ne ty of reasons, public education is i n dec p t rouble i ri A mcrica. 
- ^Wc^neeS urgently to knp\v the dimensions of this t rouble, wc need to know 
^ which.approaehc.vtec and dcviccsAVork and^vHiich ones fail; The 

inn occ n t i p u p i Is ca n't : t e 1 1 us; the dc fc n si ye cd iica tpr.s don * t w a ri t t h e i r 
5cfi6pls= compared, pa rents, a ill-equipped for e%alu»]ition. That leaves 
the standardized, tcsts^ are, \ve had" better keep them 

^nliise. \. ^ . ' ^ , 

Exactly orie week later, October j l, •i975,Mr, William Raspberry^of 
fl}^ W^iihpoj^ ^^^^ devoted his column to a di^cusjion of the sanie 
issue, pfXhe^ Elementary PrihcipaL HQ concluded: 

Teachers. (and. scHooJ districts) who want to conceal how cfTtctual they 
a re, can,a void comparisons other schopluhits serving similarpopiila- 
: itioris' by avoiding jst^^^^^ testing; 

Tsu^pcctJhat.one of th^ parents are reluctant to let go of s(an- 

. dardizcd jcsts^^ as bad as they arc. is that they don't trust theschoolsJo give 
. them candid Revaluations of how \\cll the schools are pei'forming. 




. jMeahwW^^^ f/SVl ^ reported that the annualievaluation of 

"iHAEP 4yas: ^ nine-member teaiti appointed by the 

^fiepkrtme^^^^ Healtiu Education and/Welfare. The team said that 
Nltifn^^^^ of limited use :)o states 

. :Snd^h^^ TOen it was reported^ director^of NAEPjcom- 

imehted^^^^^ 

^^provile^^^^ that wilPbe us^^^^ iii the decision-making ij)r(^^^^^ 

. Jn schbojs a^^^ it was reported that a s;^|:_^s- 

. ;:peis6n^(theiMehh^ 

Accounting.Ofe criticism saying.that it i not NAEP's business' to set 
-inationaji^^^ : 
Such^an e^ of comments can only be^m !q the: 

,teax:hierr^^ might look upon National Assessment as crifefionr 

feferehced»^ only to learri that ijpw it:is suggested that it be nor m- 

refefenced] Add to^ the article by James Popham^Jh ,the May-1976 
. ^%i De^^^ suggesting that there can. be nomative d^^^^^ 

. "cntefipfe^^^ . ^ „ - 

;Now\wex6me to .November J975, when represents of som| 

35 .op46maUonai educations 
, ..education gr^ met'in Washington to consider im of >yide- 

ipreadiuse of s^ conference was convened by the 

ifetlona^^^ 

• -Dakptajtudy Qrpup oh Evaluation linder a grant from the R(Kl<efeller 
Mrpthers.Furid /fh^ mohth the draft of h, tiineTitem position state- 
fneritfl.Nvas feleased: Fpllowirig the se0nd meeting of the grpup^ 

Becoming Eree-for^^ reported that.the symposium had not yet 
, agreed -gnia b sta tenien t about tests but that^ the^participants^^*^^ 
squa/e pftaUjfep^^^ of^seyen^test publishing companies who 

attehdej:*' The third meeting of the group wasjheld in Uhe early fall of 

iiS7j6;lbut asye 

As ifaii phhiscontroversy isnot enough, even the National Councii 
pf%achers of English (WCTjE) added to the confusion: At their annual 
.meetfng Jast Thanksg defeated a reso- 

\,JuUon tpj^elim froni tests biecause they were afraid ^ 

JC.they d^^^^ reWtioh, Jt would ap^ they favpfed. stah- 

-'_ ^daidized^tests!: ' ' r. - ^ J ] 

n\ About Jhe time Ehg teachers were hot considering test biaSi t^^^^^ 
National institute of Education (NIE) convened a three-day conference 
,pnrbias in achieyemen 6ne account* repprted that. Robert Ebel 
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vt 

of cMicttigan University said there is. no direct evidence that 
JicHieveh^^^ tests commohiyrused jii th^^ co biased, andnhe 

^chanc^^^^^^^ such evid^hce are "quite Jmp^ 

Green, aifoiffom^ dTs^^g^e.^A- saying that the tests are 

iinapgrppriateiof^^^^^^ black; Spaiiish-speaking, a white farn- 

^ JlieS; and worsu pn such tests are ysed a^^^^ 

fer;a>watered^dowti.c^ dreen reportedly said^ howeyef, 

, thaV he^^^^^ up;'' hot abolishing. standafdized M^^ 

because^ said, -Fd sooner challenge the bias of tests than the 
$iases^of<^^^^^ ^ 
, Anoiher;^^^^^^ relates to thje opposin^^^ 

y]_ewp6iht|\df^ TfiTe NEA position 

.w^s^^yjde^5'^publi0ize^ 

tive diirbctof spo& to. the CommonweajthiG of San Francisco. The 
.'iheddlin^^ /^^por/^r^VprcKjaimed, **Stanf[ardized Tests Must. 

\ Go,iHern^^^ Conversely, the American Federation o^feach^^^^^ 

, . ,(ARr;)J\p^^^^ annual nieetihg in. August 1976, 

andicatinglhat j standardized tests, theylshould be 

-imprcyedi b^ tjiey should ^not be used for evaluating teachers or 
: ^sKff^pef^^^ , ^ 

.^firie4he^argumen standafdized tests :gd. on;, a ti-erid innhe 

: Lcouhtry which^ undoubtedly will involve considerably more testing 
sHpuid.jnpt be ign'oredj^.that is the 
.r^[3oftVdirecen that already five states hjye enacte^ 
. .pelency testing- a^^^^ states have initiated studies or 
- cgmpejencleslGritefion^based o will be a lot of testing. 

:fnia recent Gallup ^^p^^ the question was asked; "iShould all high 
. sj:jiopj Huderits in the y riited Stales be required to pass a shindifrd 
,exMmihation;in order to get a high schooldiploriia?'' A total of 65 per- 
icent:pf:the^^f^^^^ ''y^^-^ . 

Inrepprlingion the results oftht^ same Gallup poll, a large headline 
in Jhe^September 25 issiic of the Giricinnini Enquirer stated, "Amcri- 
;jyns T^^^ When, asked for reasons to explain the 

decliha in national test scores, only 16 percent of those poHed' by 
04''up S'^Vg'?^ ^ re^on that the tests are notreliable/Since we do have 
aniin|ere<;xed:p^^^ it seems especially appropriate for us to consider 

,Qnhe mib ~ ^ ~ 

.J^j0^;Avhere does all of this lead us? Obviously^ Nvhen the Elemen tary 
Priricipajs Associ^ the National Gouncil of Teachers of English and 
ifhe^NEAare objecting to standardized tests, there is a problem. In this 

H 
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, whoif confusjn educators we.havean ' 

pWigajion ta regarding stah- 

/d|rdi£ed te^^^^ sta^merils that "stan- 

dafdj^e^^^^ suggesting; that the te^i of 

. . .Genewi'}Ed^^ (GED).be abolished, a test which Js 

giyen\anni^^ the United Stati^^^^ , 

-establish schcK)lFeqm^ We thave>giyeri the test in (iiji' 

dnn^^^^ yea3 a,n^^^ to a:better 

i job amira better for many, persons in our city; yet // is a.standard- 

-fTiJrj^ifeg^ admission^ lests (SAT^a^ are other examples of 

standardized tests which shpUld be cpg§^ If we cast these:oUt,;aje 
we g^ing to return to the days, when caraidate^s for admission td.a)nege 
ha^d^to^take a dife^ each cojlV^e^^^ijere^h 

,^,apj>liedl^ to the SAT, the prestjigious eastern colleges adimtted 
students, pnmari^ from eastern prep schools and a few public schools. 
.Afterthe^^^ officers discover^thatth^^ 

^ afe ^^paW^ the cpjuntp^^ and. consequen student 

-pppuIations.Were drawn from a.wiider, m6r^ feptesentative are^. Also, 
b^fore We throw out the SAT and ACfv we should think about the effects 
ofithe .^amijy Rights arid Privacy y\ct which opens all records to parents 

' ind^ stu^de^^ counselors and teachers are jikejy to be far 

. Jes^. candid in . their letters: of recomiriendation; If we eliminate,, tests 
and leuers:pf:re^^ has only grades 

anj f^hk:in^d j|nd school grades varyjM>n- 

,/sidera^^ school to. another. Then, for the admissions p||^^ 

, .wim^ as p/i/v? grad es arid class rank; for decisions^ it will be-, na ttiral to 
;fdiy5r;the sch^ she knows best; and we are right back ^here 

we?staftedi ' ' 

. r ^If "stan^^^^ go" mearis.wp testings J think thatis.an 

tin real istip point of^^^ iri.the society in 'which we live. People 
kiecfedcpntinuaU^^^ 

decide iW^^ officers, deci^^^ 

admfttei^.eniplp^ decide who will get the job; fpotball scouts decide 
who wiU get the schplarehip^ decide who wUK 

.make the team. "Aha!" ypu may say "But many p^^ 
critegpri-fefere^ True. But,bejieve me, they are norm-fefereijced 
tpp::Itys strange that rip pne objects to all of tfidse norms in the world, 
6flatMetics-t^^^^ yards he has gained; how fast he funs, his battjng 

.average.^ttpse are all compafcd against the perfoiroarice of other 



lindiyiau^^ M^y^^^ iMy are accepted because they are-hot: called 
-standardi^^^^ ; 

' . Jlicr8Js^^ seems to be unrealistic to.say that 

^standard be^eliminat^^^ With the national average of 

:6yer4^^^^^ cost to educate5a/p^ ittseemsMikely jhat 

tlic,|)arcn^^^ afe;gbihg tp^want some evidence other than, 

;ycrKa|-assura^^ 

And^ thatHeads m^^^^^ to sti 1 1 %noth er ;|)pin^^^ i jrm jy beliey Hhat hio st 
4?acficrs\a^^^^^ at:teachjngrif the chij^^^^ are hot learning;. 

Ihemweiffeed^^^^ of evidence to esta^^ they are not 

:|e%min^^ hard data. 

What^^^ attituiies? What about attendance and nipbiiity) 

^hai:apout:a^^^^^^ instructipn? And wh^ 

acHicV^me^^^^ me we Have an obligation, to help teachers ^ 

wuh^tfc^ interpreting such data. 

In 1^4 MlSf^ate o/ th^ we must consider standardized achievehient 
.fesls as one gCma^^ M^^s of data.we use; They dp provide good infor- 
mation^ aBo^^^^^ the achievchient . of ^ individual pupils, esped^lly when 
.scores afe^^ 

;4%ta are p^^^^ are tp be_fair; 

fithink^w,!^^^^ recognize that reporting achievementdata, at least 

fprr.^ s^^^ 

^menrih 

, 'PSi I that th(^se^^^^ who work with tests arid testing haye 

59! done put be^sHp help"^^^^ the public to linder^ 

slandith of standarSized tests/Quf neglect is 

jljusifa^^^^ a staieinent in, a/bppklet published by the National . 
^^K"^^^^ ?HbLLc Relajipns. Association entitje Releasing Test ^cgres> 
iUucatio^^ How JO tell (lie Public. Here is. 

"wHaliit;sa^:._ 

Beware gf^^^^ The natural impulse ih attacking such a problem 

^is to assem%Jhe tey^ slalisticiahs to explain. Bljkheware 

^of fhis seemfngly. simple (//?/?rJC?^/c/^ Statisticians and lest specialists phjoy 
jSjling p^^^^ that very well But they have trouble 

with ed ucato rs. Th c , e vi deiice : Eve ry th i hg goes w e II u nt i I somebody asks 
^lgfleslidri;^It's:ail^^ 

|^^»ypy-ha ve : a;^lest J speci alisl o r sial i st icia n on yp u r sia ff who ca n popu r 
laHzejlhe.p^^^ fare jewel, Jf nol. haye 

{^^fi?. wrk y e ry cl ose ly . \y i ih yo u r i n f prrn a lio n speci al isis as ih ey p f cpa re 
AHcjr: eTxpla^ 



■ . A^Ms stateiheht does hot convince you that we' have a problem; then 
, ■ .iVrefehybUFtVaiStateriieht>by 'He.hty/^ spoke .to , the test 

- ..directorsiof^rge.-^^^^ ' 
X^i^litsiiajff^iy^as^ does. He sajd: 



d find'disturbihg .. . the behavior of many i^^^ 



TiTriiinent a fasciriatirig field of inquiry, but who retreat from all the contro- 
.y^Rief^dvert^sting ahd^ intovcbzy little cptenes 

>vhere they'writeibeautifu essays to one another that are so heavil}^ laced. 
^With'initheniatical eqiiat thatut is a rare pef»n out there .iii t^^^^ 
schpois' who can understand what, they- are talking about. Much of what 
they^Sroluce can be of extraordinary- importance: to your evaluator pn 
;ihejfront:;iine. but it.is almost ai\yayiburied so deep irt technical books 
.aiid journals that, for all intents and purposes, it is irretrievable." 

As an -example, Dn Pyer cited the hurnal of Educationdim^ 
we#(JEM) published \a^:\\\t National Cqmcm 
./|</«cfl//dkXNGME).:The irony is thaiNGMEis intended to serve the 
■6factitioner. Lest sbmie in the audience are concerned' that l am; sug- 
lelt^ngsJEM has no pl^^^^ in NCKiE. T wish to assure you that; is 
farth^st^from my mind. #hat :I am 

inatib'nJbearahslated.into pubiica that can be finderstoqd by (hpse 
viw afe riot pisychometricians and measuremerit 'experts. A lprig.^me ago. 
-•^llAipublished a series called "WJiat Research Says to Jhe Teacher," 
bufifncri'owoffriosiriiiiar^^^^^ , ^ ~ i. 

iyiriow you;must;have enough of chaos arid coritroversy. Perhaps 
you mayahink.tiidt this recital of evehtiiri , testing OT two decades is 
=bjt%» much. J shdi now conclude with 6 ' ^ 

^ ^Pubirc'ations about tests an^ testirig are aliftost totaRy.!^^^^^^ 
huifi6r.^s a Ciricinnatiarii I think itappropriate to spihat.99-44A100 
.percent.of them cari be so aat%pfized^ 

^vides a'niple.probf that there.is.nojjumor iri testing, Ix^ecided to take a 
drastk step to improve. the situatiori a^d quote Art Buchwald, who vyas: 
i^ecentiyyinterviewed ori the "Today" show. He was asketi if the lac| of 
iuMo^ iri^tk presideritiarcampaign .pr« 

:ppii.ticaUatjrist. He replied; '-'Just because there's rio ihiinfior^oesn t 
iIfteari.li^p't*furi^ /'-v^ * 
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T am going to talk about several applicatjons of test theory in the public 
^.^ii|SS?5t^ rlS^fad tunning: tlirqugh ihe various applicatioris is the . 
cvaluatipn^an^^^^^ tests^ fo^gahicular individuals, or for ^air 

- fticujar subgroups, or at particujar ability levels. ^ ' 

*An^6uist^^^ npViyct cpmpietdy_dig^^ 

: fP^JShpmetfic ex three, }971 articles qp test bias and 

^ ' ^aiilfure f^^^ by Robert L Thorridike\ by Richard L^rlingtbn^ and 
' by RobeftiL.U and. Charles Werts'. Until these afticles appeared, 
V niany of us thought that we cbiild determine whpther a test orselectibn 
procedu^^^ was/fair or unfair to minority :groupV^ by u^^^ simple 
* statisliyaj prc^^^ important contributions. was, 

clegr thai >yhat is fajr according to ohe definition niiy be quite 
0 cuhfaitC^^^^ ^ ^ 

Consider- the selection and ^hiring of jpb^^applicantSs or t^e selection 
, oRpeop|e for to college. Suppose fii^t of all tjiat fn advance 

df selection^^w^^ some adequate criterion nVeasurc on ajj:. 

- applicants, in this very unlikely situation, we might simply selcc^ 
th? applicants oh/the basis of critenon. .score, ^regardless of g^^ 
memb^^^ ^ „ O ^ 

Whether or not this IS a- p^^ not a 

mcasufemert^^ The measurement problem ta^^^^^ 

as^is pfdinarily the case; we do hot have the fcri^rion measure available 
a^the jimc of wc have available a test score whose 

^only yrrtiie Js, jhat^it^predicts the criterion measure. The correlation 
,j 6 ' ' " " ' ■ ^ ' ' ^'V. ' 

•fag of this tollc and Fi^. J'6jtrc taken Pfom a forthcoming pa{)cr in Jouhml o/Ei/i(car 
tiohal Afeasurmeni tilled " Pnictical Appl'ca tion ^ of I tcnvCfiajractchstic Curve Theory**. 
f^&' !'? ^3*1^^0 ^^9!"^ A Mvtprtl. **Quiclt Estimates of Relative Efficiency ofTwo 
T(«ls as a Functio n of: Abiii tv, te\ el/* JburmI of Educaugnal MeasurehietiK I ^4.. / /. 
247-254^,U$ed t^^ 
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ftetweeiijpjed^^ and criterion is usually ri6>yery highi probably no 
h\ghQtA\\an^^^ effect prselectinjg peo^ 

tlfemcn > - 

> A^con^ single cutting score; on t^^^^ 

,predictorJl^thput;regafd to:gfOT procedure maxi- 

m]zts4^^^ score d^tHe , selected, indiyiduals. this 

;^emScieM^^ but is it fair tp the 

individuals Involy^d? And in particular tp rnembere of rhiripn 

Suppose cpuid;selectpnjcrite^^^^ Suppose that selecting 

.piSitheicrite^ result/iri selectirig^^O percerit^say, of all appli- 

cank^^^^^ a^ccrt^iin minpritycgroup: Con sider how the effect of substi- 
.tutirig aipr^dictpr fb^^ the cnterion scpre. It could happen that when we 
^ usf a single cutting score on the predictor for selectiph; only 25 percent, 
jsay,^o|th,e:min6nty.^ Such^a result certainly does 

^^nofyeem{^^^^ group. 
« Theseie^^^^ 

\tion/Thils instiMipn^ admit the individuals with thfe highest 
tx^ecie^^^ But: the use;pf predictpr has'cje^^ Mui^d 

Jhe' ininority .gm as many of this group will be sdected as 

^ ,w6\ild^g ^ if the criterion were aygjiable at the time of selection. 

jf H[s fs aifnaj^^^^^ 

, • It seems ;clear. that a bad one' There are two pbs- 

. siWe..approaches ;tp correcting M. Qhe approach, which has led tpv 
im^brtahCpapers by promiheht wbrkers;in the field, is to try jp cprirecX 
alie: inequitics- resultihg frp^ by sejtingsdifferent 

e cuitjngvscpres for different groups. The main conclusion from reading 
^ the;^ papers on this subject seenis to be ^hat. different sets pF: cutting. 
icpres^AviJLbe utiliz^^^^ arid judged f>i^ by differeril people, Thei-e.docs 
seem.to be any way pf correcting Jpr a:biased.pred^ a way 
thatWflT^ value systems. 

.-^An;alterna^ attempted, is to try to 

4niprpve:Uhe predictpr so^that the sabi.e cutting score can .be used ipr 
everypne. -Whether or h particulaf.predictbr is seriously unifair^^^^^ 
.spme^minprity group depends on what the predictor riieasures. If t^^^ 
:predJctor.measures sonie trait that is irrelevant for success, a miripfijy 
, grpup4hat;happeris to rank low on^this irreie.vant trait will obviously be ' 
'Unfairly treated by use of a single cutting score on this predictor. 
Again^iif the predictpr does not measure some trait that is important 
Jbtsuccess, a minority group that happens tp rank high on thi^ impor- 
\ ;tantitrait will be unfairly treated by use of a single cutting score on this- 
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deficient predictor. An pbvious course is to trj to improve our predic- 
tors so as to avoid the unfair situations. 

- It is ihtefesUhg 'to ask-what wpuld happen if we could build a,pre- 
diclpr that differed from the criterion only because of random errors pf 
measurement. Wpuif J the use.of such a predictor with a single cutting 
scpre still be unfair. tp niihority groups? The.answer by Linn and Werts 
is that, such a procedure will slightly favor low-scoring groups and 
handicap high-scoring groups^ The reason is that a predictpr containing 
random errors of measurement ^will differentiate high-level and low- 
level groups less Avell than would the criterion scDre,^were it available. 
Tliis means that more.people will be selected Irqni Iv \ groups and 
fewer people from high group.s. 

This becomes particularly.obvipus in the extreme where the predictQr 
is almpst completely unreliable. If-the predictor had zero reliability, it 
could not discriminate^ between one group and another group,Avhich 
means that any two groups woulJ have the same distribution of pre- 
dictor scores. In such a case, clearly, use of a single cutting score on the 
predictor fayprs any group thaj. is low on criterion score. 

It may not be ppssible in many cases to produce mental tests that 
differ from an important criterion only because oCerrors of measure- 
ment. We certainly can work toward this, however. We can try to avoid 
:predictors that measure some irrelevant trait, to the disadvantage of a 
minority group. If we cannot a\oid using such predictors, then indeed 
we will have a difficult task deciding how to select cutting scores to 
compensate for measuring the wrong traits. 

Let me how turn to a different subject. In classical test theory, the 
valiie of a test is usually summarized by one or more of three coefii-* 
cients: the validity coefficient, the reliability coefficient. anJ the stan- 
dard error pf measurement. Any suchxoefficient describes the average 
performance oh the test for a certain group. . 

The magnitude of the first two coefficients varies from group tp 
group. In general, such a coefficient, reported by the publisher foF a 
supposedly nationally representative group, will not be appropriate for 
any particular teacher and his or her class of students. A particular 
classroom is likel) to have a smaller range of talent than.a nationall} 
representative group. 

The standard error of measurement of a test may be reasonabl) con- 
stant from group to group, pro\ ided the groups are not v ery different in 
ability level. But now we have a different problem, we can compare 
standard errors of measurement from group to group, but not from test 
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to test. The standard error of measurement is expressed in terms of the 
raw score scale, which \anes from one test to another. If we use stan- 
dardized scores instead of raw scores, then we cannot compare standard 
errors of m'easurement from group to group. 

What IS needed is a method of describing the efl'ecti\ eness of a test in 
a way that will be appropriate both for across-group comparisons and 
for across-test comparisons, provided that the tesl§ are all measures of 
the same trait, ability . or skill. Does this sound impossible? We can come 
close to doing this. 

Figure 1 shows the relative efficient) of two widelv used tests of 
reading vocabular). The relative tfficienc) \aries according to level of 
developed abiht). which is shown along the base line of the figure 
Specificall). the figure shows the relati\ e eflicienc\ of a reading vocabu- 
lary score from the Sequential Tests of Educational Progress (STEP), 
relative to a reading \ouibularv score from the Metropolitan Achieve- 
ment Test (MAT)."The data describe a particular form of each test 
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Relative cflicieno of STKP compared lo MAl . 
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What is meant b> relative efliLieiicj here? The enkietiL} ol a single 
lest at a pariiLiilar abilit\ le\el is unerselN proportional to the sunared 
standard error of measurement (or people at that ab»lil\ le\cl 

Jf two tests measure on the same, ^eore scale, then their relati\e 
eflkienc) at a particiilarabilit\ le\e! issinipl) the ratio of their stjuared 
standard errors of measurement at that le\eL Since two te.sts from 
different publishers t}pieall) measure on difl'erent score scale.., e\cn 
though lhe\ are tests of the same abilitv. an adjustment must be made 
for differences m score scale. Thus the relali\e efliciencv of one test 
with respect to another at a particular abilit) le\ el is simpi) the ratio of 
their squared standard errors of measurement at that Ie\cl adjusted for 
dirterences in score scale. If one test has a relati\e ctHcienc) of .3 with 
respect to another at some abilit) le\el, then doubling the Icngt*. of the 
first test will make it as eflieient as the second test. 

Figure I shows that the S FEP test is more eflicienj than the MAI test 
at abilitv levels, but less efiicient at all other le\els. This reflects 
the fact that the STEP test is much easier than the MAT test. It is 'well 
known that an eas\ test disc rinu nates best among low -lev el students. 
A hard test discriminate> best among high-level students. 

The STEP test is shorter than the MAT test. The dashed horizontal 
line shows the relative efiitieiKv that would be found if the two tests 
diflered only in length. 

Figure 2 shows the relative enicienLV of a paHicular form of anv)ther 
published reading v^xabularv test con) pa red to MAT. This test is less 
ctVective than MAT for most of the range of interest here 

In these figures ihi b^oe line is calibrated in terms of percentile rank 
for a particular group studcnt> The tup horizontal line is calibrated 
in terms of raw scoic>« .»ii bi)th the lesis administeied. With the aid of 
such Ogures, if a teaciicr kni>WN the abilitv level of his group or the abilitv 
Ic .'Is at which he wi ^js lo make clfcctive discrimination, then he can 
make an informed Liioice among available published test> This is 
much belter than rclvinj on cue Ifici cuts reported bv the publishers for 
groups that contain students at abiiit\ levels not relevant for thi> teacher. 

How do we get these relative ellictencv curves? The v can be pri)duced 
bv a rather complicated and e\pcnsi\e process based on the estunaliv)n 
of Item parameters bv item response (heorv l ortunatelv a usable 
approximation to the relative efiiciencv curves can be obtained directlv 
from frequency distributions of number-right scores, as I have pointed 
owl in a 1974 issue of the Journiil of nJiuitttonal Measurement^ .The 
dashed jagged lines m the figures show the appruKimatiuns obtained 
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Figure 2. 

Rclali\c cfficicnc) of Fonn A. Reading Vocabularj Tcsl compared lo MAT. 




direcll) from the numbcr-nghl score distributions, with the help of a 
de«k calculator. 

Such relative efiicienc) curves ha\c many u>t;^r-besides choosing 
among published tests. RecentI) at Educational Testing Service (ETS) 
and at the College Entrance Examination Board certain revisions of the 
Scholastic Aptitude Test (SAT) were contemplated. A possibly desir- 
able revision was to try to make the tests easier for low ability students - 
provided this could be done without impairing the measurement efTec- 
tiveness of the test for high abiht) students. It was decided to investigate 
the effects of various possible changes from existing forms of the test 

A particular form of the \ erbal SAT was chosen and analyzed We 
then asked such questions as the following. Suppose we took the five 
easiest items m this form of the verbal SAT and added five more items 
with statistical properties exacti) like I.ese. What would be the relative 
efliciency of the resulting test? This relative efliciency. relative to the 
form of the test m actual use. is shown b> curve 2 in Figure 3 As might 
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Figure 3. 

Relative etficienc} of various modified SAT Verbal tests. 
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be expected, the effectiveness of the test is shghtl) improved for exam- 
inees at low abilit) levels uithout much change in the effectiveness of 
the test el.cewhere. 

Curve 3 shows the effect of eliminating a block of five medium 
difficult) ifemsin themiddleof the test. Efficienc) is impaired formiddic 
ability .students, but there is not too much effect elsewhere. 

If we siniuUaneousl) add five easy items, as alreadv described, and 
eliminate five items of medium difficult), the relative efficiency of the 
resulting test is shown by curve 4. This is seen to be a sort of combina- 
tion of the Other two curves. It does seem to be possible to improve the 
measurement effectiveness of the test at low^ability levels without 
sacrificing its effectiveness at high ability levels. However, we do lose 
effectiveness at medium abilit) levels. In general, experience shows that 
* any gain achieved atone abilit) level i^ usually paid for by a loss of 
effectiveness at some other level. Usually the onl) way to avoid this 
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rule wouid^e to write,.better items; but this increases th'e cost of test 
production. 

There is something to be learned from curves 6, 7, and 8. Curve 6 
shows .what vyou Id happen if we simply discarded the easiest half of 
,the items in the test. The half-length, test would be almosT as good as the 
;full4ehgth test for high-ability ^students. Such a test would of course 
,t>e virtually useless for low-ability students. This tells us that the easiest 
hairof. the items in the current form of the SAT Verbal test are con- 
tributing very little towards measuring the high-ability students. In 
effect, only half the time spent by the high-ability students in taking 
the test is of any use for measuring them. 

Curve 7 leads to a particularly interesting conclusion. Curve.7 repre- 
sents the relative efficiency of a half-length test obtained by discarding 
the hardest half of the items in the Verbal SAT In contrast to curve 6, 
notice that here throwing away half the items improves the measure- 
ment at low-ability levels. The reafson is that low-ability examinees 
guess at random on Jiard items. The resulting random noise lends to 
drown out whatever measurement would otherwise be accomplished 
by the easier items. 

The conclusion that I want to emphasize is that we cannot make a test 
appropriate for low-ability examinees simply by adding some easy 
hems. As long as the test contains many hard items on wl^ch these 
examinees guess at random, the test cannot be a really effective measur- 
ing.instrument for them. ' 

Curve 8 shows the relative efficiency of a full-length Verbal SAT 
whtn all the items are at the same medium difficulty level. It is obvious 
that replacing medium difficulty items by hard items and by easy items 
reduces the measurement jeffectiveness for most of the examinees, 
since most of them are in the middle of the abiMty range. 

All this suggests the following conclusion: If we really want effective 
measurement for both high-ability examinees and for low-ability 
examinees, and furthermore if the ability range in the group tested 
is sufficiently large, then it will* be impossible to achieve our objective 
with any conventional test. The objective cannot be achieved simply 
.by adding hard items at one encl and easy items at the other end. It 
becomes necessary to tr) some unconventional form of testing, such as 
multilevel testing, two-stage testing, or tailored testing. 

Before discussing such unconventional tests, consider an alternate 
possibility. Let us take our conventional test and score the answer 
sheets in the usual way. After doing thi^ Jet us divide the examinees 
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into three or four groups according to their scores. We can tiovy rescore 
the .ahs>yer sheets in each subgroup using a set of item scoring weights 
apj)ropriate for that subgroup. iFor the highest subgroup of.examinees, 
an appropiiate scoring wejght for each item will be roughly propor- 
ttonalvto th^ of the item, or to the item-test 

;biseriai correla^ for the lowest subgroup of examinees, the proper 
item scoring weighs are quite different: the difficult items should each 
receive a scoring weight of approximately zero. ^ ^ 

.After resconrig each subgroup with iteni^scoring weights appropriate 
-tpjhe subgroup, the scores from different subgroups will all be put on 
the same scale,.by conventional equating methods. Once this is^^one, 
each exanfiinee tested will have been scored with a set of item scoring 
weighls.roughly appropriate for him. Thus each person will, be meas- ' 
ured more effectively than under con vetitiqnal scoring procedures. 

Although this would result in some improvement, I do not believe it 
is very effective solution, to the problem under discussion. If only a 
quarter, or a third, of the items in the test are really appropriate for low- 
abiUty students, then no amount of statistical manipulation will make^ 
thi^ihto a really good test for such students. The only way to achieve this 
is somehow to arrange so that such students take a full set of test items, 
ail of which are appropriate and effective for them. 

I am not necessarily urging that effective measurement of low-ability 
students should \>q a prime objective of the College Entrance Examina- 
tion Board. Most of the colleges that use the College Board tests*are 
concerned with effective measurement in the upper half or two-thirds 
of the score range. On the other hand, there are some colleges using 
these tests where most students score in the lowejvpart of the range. 
Thus is may be desirable for the test to measure effectively there too. 
Also, it may be desii-able that the test should not be a traumatic 
experience for those lower-level examinees who take it/ 

If we wish to be sure that the difficulty level of a test is matched to 
the ability level of the particular individual taking it. we can consider 
various unconventional procedures embraced by the term individual- 
ized testing. There are various names for these procedures such as 
computer-based testing, branched testing, sequential item testing, tai- 
lored testing, flexilevel testing, multile\el testing, and two-stage testing. 

The United States Civil Service Commission is carrying out an exten- 
sive investigation into tailored testing. It has several computer termi- 
nals in its Washington office where volunteers are invited to take a 
tailored lest. Vern Urry at the Commission tells, us that this experi- 
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mental.workJs very successful. The people taking the tailoied test like 
it, better than the conventional test.. Furthermore the Commission is 
currently able to achieve with about twenty items what formerly would 
require, a hundred items. The Commission is making plans to use 
tailored testing on a nationwide basis in about five years if no^unex- 
pected'obstacles are encountered. 

Computer-based tailored testing iS a fairly complicated procedure 
requiring some initial investment. There is a simple procedure called 
'm„ultileyel testing which is currently more readily available to all of us. 
An experimentar study into the effectiveness of a multilevel test .was 
recently carried out under the^irection of Dr. Gary Marco, ETS. The 
finalreport on this study has not yet been issued;^ todayil will simply 
defci-ibe a multilevel test. , ' . 

Suppose that we have a set of fifty items all measuring roughly the 
same. psychological trait or .skill or ability. The items are arranged in 
^ five levels: a. b, c, d, e, in order of difficulty. All students start the test 
by answering level*c. At this point they are told that if the items they 
have answered seemed rather difficult, they should next answer leyel b. 
If level c seemed rather easy, they should next answer le^el d. When 
they have completed a second group of items, an appropriate set of 
instructions is again given allowing each examinee, to choose a third 
level of items adjacent in difficulty to the levels already answered. 

Each examinee winds up taking a block of exactly 30 consecutive 
items (3 consecutive levels). Each answer :sheet is scored in the usual 
fashion. There are three different possible blocks of items that an 
examinee may take: abc, bed. orcde. Scores on these three blocks must 
be equated across blocks. This can be don c by conventional methods, or 
by using item characteristic curve theory. Once all scores have been put 
on the same scale by equating, each examinee should be measured more 
effectively than by a conventional test, since each examinee has pre- 
sumably taken items better matched in difficulty to his ability level. 
- It may be helpful to think of a multilevel test as if it were a three- 
stage test. The examinee does his own routing. This avoids the problem 
of scoring each stage in time to route the student to an appropriate 
later stage. 

You can all think of various possible difficulties with such a multi- 
leveLtcst. Suppose an individual does not route himself appropriately. 
In this case, the worst that will happen is that he will be measured less 
accurately than otherwise. If the tests are properl) equated, his expected 

26 

JO 



/ 



Frederic M. Lord 



score will not be affected. We hope that most of the students will route' 
themselves appropriately and thus be measured more accurately than 
.byt SOTitem conventional test. J \ 

f From what I know of the results, the multilevel test tried out-last fall 
was about as effective as expected. A detailed discussion wul appear 
in the firiai report of this study, at which point the practical value of 
mjultileyel lestm^ can be better assessed. 

Another recent appjication of test theory in the public interest is item 
sampling. When examinees are sampled also, we speak of matrix 
sarnpling. Although this application is well established, many of the 
necessary- mathematical formulas are so long and cumbersome that 
they have never been worked out. I would expect that the next imT 
portant basic development in this area would be a computer program 
by means of which the computer itself will carry out the mathematics 
and derive the necessary formulas. 

There are several other important, relatively new applications of test 
theory in the public interest. One of these is the design and evaluation 




TiA Thtoiy In the Public Interest • 



Figure 5. 
Item resjtonse curves for item 2. 
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of mastery tests. My own opinion is that Allan BirnbaumVChaptor 19 
in Lord and Novick* provides a detailed and clearly worked out theory 
for the design and evaluation of mastery tests. Other approaches will 
doubtless be effective also. 

Another area, still very much in focmation, is the use of tests in indi- 
vidualized instruction or In computer-assisted instruction. Such use of 
tests may come-under the heading of.mastery tests. I find that it is con- 
siderably different from the tailored testing discussed earlier. 
. In closing let me return to the question of bias, but now instead of con- 
sidering test bias, let me talk about item bias. In the last three figures, the 
base line in each figure represents ability or skill. The curves in each 
figure represent the probability of success on a particular item as a func* 
iion of ability level. The three figures are for three different items from 
the Verbal Scholastic Aptitude Test. The solid curve in each figure is 
foi a group of white students. The dotted curve is for a group of ^ack 
students. 

In Figure 4 we see that high-ability white students do better on this 
Item than high-ability black students, but that low-ability black students 
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Figure 6. 
Mem r<^spon$e cunes for item 59. 
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do better on the^ileni than low-ability white students. Black students 
do better than white students throughout most of the abilil) range. 

Figure 5 shows a partially similar situation except that in this case the 
item lb totally undiscriminating for black students. High ability black 
students, as determined by other items on the test, do no better on this 
item than low ability black students. 

Figure 6 shows a difficult item on which blacks do better than whites 
at every ability level where there is a difTerence. There are. of course, 
other. items on which whites do as well or better than blacks at each 
ability level. ' 

Such items contain a bias, a somewhat complicated kind of bias. 
It would seem desirable to exclude such items from our tests as far as 
possible. Let me emphasize that the cur\cs shown hercAvere picked 
simply because they did show a deOnite difference betw een black groups 
and white groups. Most of the items in the Verbal SAT do not show 
) .rge biases of this kind. 

The.se curves have only recently become available as a result of a^ 
study designed by Dr. Marco. We have not yet had time to study the 
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tesltiems and compare them with the statistical results. It is to be hoped 
that:as a.result of^such studies, we will learn how lo design items that 
db not show these kinds of bias. ^ 

Tlie.thread running through the various applications of test theory 
that fchave discussed is the evaluation and design of tests for particular 
Indiyidualsyor for particular subgroups, or at particular ability levels. 
Suclilconwrns represent worthwhile applications of test theory in the 
public intefest. . ^ 
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y . 

Having had to submit a title for my paper-*The Baby and the Bath 
Water Are^StilLWith Us^-before I had begurt to write it. I must now 
>try to make it work... 

Once upon a time there was a baby-a beautiful, smiling, unspoiled 
baby whom everybody admired and who, the people thought, would 
bring enlightenment into the world and open doors long barred to most 
of them. One day, when the baby was being bathed, someone noticed 
that the bath water hadn't been changed for a while 3nd it had gotten 
cloudy' and somewhat dirty. For some strange reason no one in the 
household was quite'sure what to do about it. Some advocated throw- 
ing out the bath water. Other^said the baby should be thrown out 
because it had contaminated the bath water. Stii|j3thers argued that ' 
since both the baby and the bath water were obvioysly contaminated, 
it would be best to get rid of\hem both. A group cfvery^^nservative 
members of the household, not willimyatake any risks. opjL^ for keep- 
ing both'^but conceded that the diuJSp^ could be rempyed a teaspoQn- 
ful at a time and replaced by cleaF^ter. And so. som^uhd'etermined 
number of teaspoonfuls later, here we arer the baby and much of the 
bath water are still with us. 

So much for the analogy... 

We have had." oyer the past two decades, some enormously complex 
problems reiattng to testing. And although it is obvious that we have 
made some progre^ss on a great many fronts, we cannot reall) say that 
we have taken a giant step or two fonvard. 
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Defining the Issues 

The issues are quite familiar to most of us. Broadly defined, they con- 
cem the purposes of tests; the test, content and what it measures; and 
the ways in which test results ,are presented, interpreted, and used. 

Why do \ve test, particularly in the schools? In the best of all possible 
worlds the main purpose of measurement in the schools should be to 
facilitate understanding of the individual as a whole, complex, con- 
tinuously developing person. Such measurement should provide in- 
formalTon about the individuaPs cognitive and noncognitive character- • 
istics. style of learning and of solving problems, and his or her needs, 
values, interests, and goals. Such information should also help teachers 
and counselors to provide the best possible instruction and guidance, 
and mtervenMons designed to enhance personal dev^elopment.. 

Unfortunately, however, this is not the best of all possible worlds, 
and truths, half-truths, and untruths wage a chaotic war within it. 
Today's tests, it is charged, do not measure the more elusive quali^ties 
of an individual, such as creativity or the ability to cope. True, bufmost 
tests-espccially those given in the schools-don't purport to do so 
The test title and tlie technical manual usually make it clear that the 
tebt IS a test of reading achievement, for example, or mechanical under- 
standing, or vocational interests. Until measures of these other qu.-^lities 
-have been developeJ successfullyu we shall have to be content with 
UMng, along with those test scores that are available, all other informa- 
tion we c^in gather about an individual a highly recommended practice 
at all rimes, regardless of how much test' data is available^ 

Another charge-in fact, probably the major charge heard against 
testing today-is, that the test content and the resulting norms reflect 
the dommant culture and are insensitive^ to differences in experience, 
language, and cognitive style and the ways in which they might inter- 
act with test directions and test content.' Normative data, it is further 
(.harged. make unfair comparisons that are then used to pin erroneous 
labels on membersof minority groups, limiting theiroptions with regard 
to education, career, and way of life, and perpetuating destructive 
stereotypes. 

Few uoOld argue that there is not one iota of truth to these charges 
Tests ar^ sometimes misused and their results erroneously interpreted. 
Individuals have been erroneously labeled and relegated to a very nar- 
row set of options. Test content sometime.s tloe^ reflect instances of bias 
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-both cultural and sex. Irrelevant test^ have been used for employee 
selection and evaluation and for other purposes for which the tests in 
question were never intended. 

'Tests and Bias 



Almost ail of the charges relate in one \va) or another to the issue of 
bias-some to a greater, some taa lesser extent. To deal with the issue 
of bias, though, it is first necessary to know what it is we are talking about 
and to be sure that we are all talking about the same thing. As of now, 
we. are far from agreement on a definition, although the literature of 
the past few years eontams a great abundance of studies cjf bias and the 
attempts to correct it. Cleary* has suggested that a test is biased 
if scores for subgroups are consistently predicted too high or too low. 
Standards for Educational and Psychological Tests' alerts test users to 
the existence of many diR"erent definitions of bias and fairness and 
points out that whether .1 given procedure is or is not fair ma) depend 
on the definition accepted. Somewhat similar problems have arisen 
with regardlo the definition of sex bias both in ».arccr interest measure- 
ment (Diamond'. Hanson & Prediger ^) and in achievement testing 
(Diamond*). 

Breland and fronson' ask What is a minoritv? What is a disadvan- 
taged applicant? The problem of daysificaiion of difi*eront minorities, 
the) have found, is a^ complex and virtuallv insurmountable task. The 
DeFunis decision, for example, defined a minoritv as a select group of 
nonvvhites, excludmg Asian Americans except for Philippine Ameri- 
icans, and excluding Puerto Ricans but not Chicanos. 

Eber has argued that "The bias which accounts for poor test per« 
forniance by some minority persons is not ?n the tests .so much as it is in 
the culture, and thus is another problem altogether" (p. 87). Even if 
we agree and I don*t think that test bias and cultural or .societal bias 
are mutually exclusive how do we go on from there? Can we afibrd 
to wait until societv corrects its own biases, through a gradual process 
of educatipn and di/ngc? Judging from the de*segregation experience, 
that mav be a long time as much as one hnndred vears. Should wc 
instead trv interventions of various kinds including intervention in 
the testing situation wherever there is a chance that they might be 
efTeeiive? 
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Sources of Bias 

If we are to do anything about bias in testing, however we define it, we 
should first consider its sources. Flaugher" has defined three principal 
sources. . j \ 

1. The lest content. Probably the most commonly perceived source 
of bias in testing is the content of the test itself: Is it biased in lan- 
guage? Does it lack balance in its appeal to different groups? Is it 
insensitive to differences in experiences or the absence of certain 
experiences? 

2. The atmosphere of testing. I would enlarge this source to the society 
itself and pjace it above test content in importance. Much of the 
research in this area deals with the self-concept the individual brings 
to the testing situalwn and his or her perceived relationship to the 
larger society. Flaugher includes the amount of sophistication or 
experience needed to overcome idiosyncratic characteristics of the testing 
situation. Among these are the type of test item and the answer 
sheet format, which constitute the medium and which students must 
overcome in order to concentrate on the messageo^\\\Q test content 
Itself. Other variables in this category are race (or, I might add. sex)of 
theexaminer and perceived use to which the test resultsare to be put. 

3. Test use. Biased use of test results would occur where one group is 
l5>}tematicall\ favored over the other in selection, classification, and 

the like on. the basis of test results whether ihe membership group 
be black, Chicano, male, female, or any other. 

" Although Flaugher states that women "are not the usual sorl of 
mmority group and do not have the usual >ort of difficulties with test- 
ma" (p. 3), It IS not dilTicult to see the same three sources operating with 
regard to sex bias. The content of the test often reflects experiences 
that traditional social roles have closed to women or men or^have 
thoroughly discouraged thcni from exploring. Subtleties of the socmI- 
i/.ation process often carrv over into the atmosphere of testing, where 
women and, to a lesser extent, men bring to the testing situation the 
self'concept that societv has preordained for them. And test results 
have frequentiv been used to rule out nontraditional options and to 
perpetuate the status quo, 
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Looking for Solutions 

What, then, should be done about testing? How should the contro- 
versial issues be resolved? Generally, two opposed courses are 
suggested:. 

i 

1. Declare a moratorium on all tests and testing until the short- 
comings can be eliminated. 

2. R'etain the tests for the information that they can provide, and at the 
same time encourage a program of research directed toward elim- 

• ination of or control for bias and place top priority on better inter- 
pretation and use of test results. 

As Standards for Educational and Psychological Tests* points cut, tq 
declare a rnoratorium on the use of tests requires a corresponding but 
unlikely moratorium on decisions-employment decisions, selection 
decisions by colleges and universities, and decisions based on the eval- 
uation of various educational and social programs. But there always 
have been such decisions, with or without testing, and they^will con- 
tinue to be made. Colleges and universities, the Standards go on to say. 
will continue to select students, "some elementary pupils will still be 
recommended for special education, and boards of education will con- 
tinue to evaluate the success of specific programs*' (p, 2). The decisions, 
how'ever^ will be based on more subjective, less dependable methods 
than standardized assessment techniques. Moremer. tests thakare 
useful for discovering abilities that might oherwise remain unidentified 
will no longer be available^ 

To assume that such decisions can be made fairl) without reliable, 
objective measures is to assume that everyone charged \<\ih making 
judgments about others in our society is socially concerned, free from 
prejudice, and trained in the skills and pitfalls of assessment, diagnosis, 
and evaluation. If tests are guilty of reflecting middle-class valuQs. will 
the judgments of middle-class teachers, counselors, administrators, and 
employers necessiirily be less so? Can any of us honestly siiy that he or slie has 
almost never misjudged a person s capabilities or attitudes bc*cause of some 
idiosyncratic mode of dress or social behavior or some unusual physical 
characteristic? Have our own value systems never eniered into our 
judgments of others? \ 

The argument in favor of a moratorium also implies that decisions 
are made about individuals on the basis o^" test scores alone. Yel test 
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manuals and professional articles and books on testing carry repeated 
warnings that tests are to61s that provide objective and important infor- 
mation about an individual but that they do not provide all possible 
information and therefore should not be used alone but with all other 
pertinent information available. 

If we adopt the second course-retaining the tests for the informa- 
tion they provide.and at the same time embarking on a program to 
improve ihem and the ways in which they are used -what are the steps 
we should take? What kinds of relevant research and development are 
already under way? 

Correcting Test Bias 

r o 

Mo(*els for the correction of test bias thaCha\t appejq-cd in the liter- 
ature on testing over the past eight to ten >ears geperall) fall into one 
of three categories: 

1 . Correaing (est bias at the item i onstnu tion level. This is probably the 
least frequent model. It involves trjing to build a bias-fair test from 
scratch, begmning with the instructions to item writers, before items 
are pretested. One example is the work of Ray man'*', who attempted 
to construct interest inventor) items for vocationally related scales 
that would be balanced for response rate hy sex within each scale. A 
similar model for achievement tests was suggested bv Diamond\ 

2. Correcting test huis at the item distribution level. This type of model 
IS closely related lo the first t)pe. except that it begins with the'items 

" already in hand and the item statistics fordthe various groups 
involved in the testing. Medley and Quirk" examined differences 
between black and white candidates' perfprmance on the common 
examinations of the National Teacher Examinations. They con- 
structed experimental forms and compared performance i>*v items 
reflecting black culture. iht)so reflecting modern culture, and Items 
thai VI ere considered traditional. Diflcrenccs in performance on one 
test made up of equal numbers of black and modern-culture items 
and another test consisting of traditional items onl) were significant 
for 13 of the 14 pairs of groups tested Significant differences were 
also found in favor of blacks on the biack-culiure items and in favor 
<?f whites on the modern-culture items. 

Fxhternacht' compared the distributions of transformed p-value 
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differences for independent pairs of groups with a hypothetical 
normal distribution, using the. obtained mean and variance of the 
differences as parameters. He considered the test biased if points on 
the actual distribution fell outside the bans around the hypothetical 
line whose width is determined by sample size and significance level. 

Angoff^ describes several studies, including his own, in v/hich 
bivariate plots of transformed p-values were examined for item x 
group interaction. Angoff also mentions ihe possibility of building 
a test on the basis of a common core of items **broadly relevant to 
the educational objectives of society generally and the individuals 
for whom it is intended** plus items specific to the curriculum of 
each of the component groups but npt the group as a whole (p. 26). 
^With such balance, Angoff maintains, no one group would have an 
^ advantage across the total test. 

In one study described by Angoff, involving black and white 
groups; itctn xgrAUninteraction for inter-race scatter pipt.s decreased 
when groups were matched on an externa! viL[iabJj\ili^ sug* 
gesls the possibility of matching groups on socioecononmrsttHus^ 
expressed as a composite of parental occupational and educational 
levels. Angoff warns, however, that the designations for these levels 
might not have exacti) the same meanings for blacks as for whites. 

3. Siatistiml models for the t orm (ton of huts. Vtinous st.^tistical niudcis 
for dealing with test bias have been proposed bv Ciearv*, Cole , 
Darlington*', McNemar'*, Thorndike' . and others too numerous to 
mention here. The entire Spruig 1976 \ssuc oUournal ofLdiUiatonul 
Measurement Wtis a speci*il issue. On Bias tn Select tt/n. In that issue 
the Novick and Lindlej utilitv model is described b) Novick and 
Petersen''. Clear) 's model was referred to briefl) earlier in this 
paper Cole's model suggests that if both a member of the majoritv 
group and a member of th*' minont) groupcould succeed if selected, 
any procedure is unfair that tloes not present each with the same 
' prdbabilit) of being selected. It rcquireslhat different predictor cut 
off points be chosen for each group. Darlington's model emplovs a 
single correction factor v^hose variable weight, determined bv *i set 
of factors important to the selecting institution, would be added 
to the criterion scores of the kmer-storing group. McNeniar's moJl^l 
emplovs a regression ecjuat'on based o\\ the groups combineil^ith 
group membership included as a predictor. Thorndike sii^estcd 
that the percentage of an applicant group to be selected beiiic same 
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as the empirically determined base-rate of success for that group. 
These models are probably part of the necessary groundwork for a 
temporary solution oPthe problem of bias within the present frame- 
work ofinequality of opportunity. Many of these models, however, 
are ih conflict with each other in one or niore respects, and it may 
be a long time before one is developed tSiat wins the widespread 
acceptance needed to put it into general practice, 
/ Soniiething should b^ said here, too, about the various attempts 
over the years to build "culture-free," "culture-specific,*' and 
"culture-fair" tests. These usually refer to so-called tests of intel- 
ligence rather than to tests of achievement, but are sometimes sug- 
gested as replacements for standardized achiever.ent tests. I think 
that there is general agreement that it is virtually impossible to build 
a culture-free test; no group lives in a cultural vacuum. Culture-fair 
tests might fit some of the models for correcting bias at the item con- 
struction or the item distribution level. Nonverbal culture-fair tests, 
as Ornstein*** points out, generally fail to reflect the full range of a 
child*s mental abilities. Moreover, the child who has trouble with 
verbal tasks generally has trouble dealing with such perceptual tasks 
as classification, selection, and arrangement. As for the culture- 
specific Black fntelligenceTest of Cultural Homogeneity (BITCH), 
it hus been criticized by Ornstcin and others as measuring ^ very 
limited amount of special information useful for functioning in the 
ghetto. The ability to label, categorize, conceptualize, and solve 
problems-an ability important for f/// children if the :ire to succeed 
in school-is not dealt with. 

Another problem that further complicates the already complex 
task of constructing a model for correttion'uf bias or building a test 
controlled for bia.s is the fact that there are in the iBnited States a 
greai many minority cultures, some of which account for only a 
fraction of one percent of the population.. Ben among the larger 
cultural minorities there dvc diflerences within groups. The Spanish- 
speaking child of Puerto Rican parents, for example, is difl'erent 
from the Spanish-speaking childjust this side of the Mexican border 
There are comparable diflcreni.es.betueen thevarious Asian groups 
If we try to assign everyone to a clearly defined group, there will 
be too many groups, most of ihem uuh r'elati\e)v small numbers, 
to jield any meaningful analyses. If we establish only a feu major 
groups, we may not improve the situation very much. Moreover, 
there appears to be considerable evidence that the difl'erenccs^e-*' 
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tween socioeconomic status groups within a culture are much larger 
than thexliiferences between cultural groups as a whole. 

Improving the Je%l% Themselves 

While we cannot hope to fully eradicate systematic inequities in test 
performance until inequities in opportunity have been eradicated, 
there are many ways in which tests can be and are being improved, 

Ir^ number of publishers have undertaken a reexamination of items 
in existing t^sts^ with the assistance of qualified black and other 
niinority group reviewers. Items with obvious language or content 
bias are being edited or replaced wherever possible, and specifica- 
tions for items for new forms or new tests are being written with 
concern for possible bias. Tests are also being reviewed for sex bias. 

2, Biographical data and other self-reported descriptive information 
are being used increasingly in combination w.ith cognitive measure- 
ment for self-assessment and future planning as well as for improved 
prediction. 

3. Work on adap4i\e testing, tailored to individual ability level and 
other characteristics, is making progress. 

4, Advances in computer capabilities have made pusbible comparable 
advances in testing techniques such as branching and the provision 
of immediate feedback from the computer. 

5. Criterion-referenced tests enable us to determine to what degree 
an individual has mastered a particular skill or content area rather 
than how that individual compares with others, thus eliminating 
the kinds of objections that ;vre made to norm-referenced testing. 
Ironically but understandably, however, some publishers of cri- 
terion-referenced tests are being asked to supply norms as a kind 
of reference point for the mterpretj^tion of the criterion-referenced 
scores. Such normative data should^ be acceptable to all concerned 
if it involves group rather than intliv idual comparison. Schools want 
to know whether a given average score indicates strength or weak- 
ness in the domain measured, and group norms give them a picture 
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oi relative strengths and weaknesses. The danger, however, as 
vPopham*^ points out, is that users of criterion-referenced tests will 
rely on normative data as^i determiner of performance standards. 

6. Pr^ess has also been made in diagnostic testing and evaluation, 
-Expanded computer capabilities have made possible detailed and 

. highly sophisticated item analysis for local and special groups and 
^for individual students. Growth cur\ es in specificskills can be drawn 
by the computer. The effects of various kinds of interventions can be 
• analyzed along a number of dimensions, 

7, There has been arrowing trend toward the use of tests for place- 
ment and classification, as opposed to se;lection, amt a growing 
emphasis on decision-making skill,^ that wijl help individuals use 
data from tests and other sources to make for themselves many of 
the decisions that have traditionally been the responsibility ©f the 
school, the employer, or other institutions. 

These developments are encouraging, but there are still unfulfilled 
needs to be met. Some have been described by Gordon*-, Mercer**"', and 
others. We need mea,sure,s that will provide information about a much 
wider range of abilities and characteristics than present measurement 
provides-measures of vocational, social, and interpersonal compe- 
tencies; of creativity, which we ha\t\not sp far even defined success- 
fully: of cognitive style, or how the imlmijual proees.se.s information 
and generates respon,ses. We need to know hSw-fei'St to weigh all the 
information we have about an mdi\ idual in order to enable him or her 
to make the best possible decisions. We need to find ways to solve the 
dilemma posed by prediction based on the past that work;, to perpet- 
uate the status quo. We need item analysis programs that enable us to 
look at the incorrect choices children mark on tests to see whether an 
mdi\ idual or group pattern emerges that might he of diagnostic sig- 
nificance. These are only some of the needs The list is \ irtually endless. 

Improving the Use of Tests 

No matter how much we improve the quality and sensitivity of our 
tests we will have gained little if the way in which they are selected and 
UJ>ed IS not also improv cd. This must he a joint responsibility of both test 
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publishers and the institutions using the tests!" with the publisher pro- 
viding the interpretive ^naterial. descriptive information about the 
test and its purposes, and suggestions for use, and the institutions pro- 
viding the necessary training in test use, possibly with help from 
the/publisher. 

A school testing program, for example, should be b^ed on the joint 
decisions of those who will have to implement it andv interpret the 
results. This means involving counselors and teachers\or at least 
representatives from among them, in addition to the school principal 
or the superintendent of schools and an;^ne else who will plav a major 
role in the-program. 

Questions to bexliscussed by these individuals include: 

1. What is the purpose of the testing program? What is it the school 
needs to know, and which tests can help supply the answers? 

2. Do the tests under consideration fit the intended purposes of the. 
program? That is, do the tests measure the traits or content areas or ^ 
attitudes that the school wants to know about? Technical manuals 
and interpretive information should provide ans\\ers to this 
question. 

3. Does the content of subject-mailer tests - whether norm-referenced 
or crilerion-referenced-malch. in general, what students have 
been expdsed to in their course work? 

4. Is the reading level such that rfiosl students can be expected to 
understand the language of the lest? 

5. Are the directions to the students clear so that the average student 
will not have difficulty following them? 

6. Can the results be used for diagnosis of specific difiicullies as well 
as for general measures of achievement, ability, and so on? 

7. Are the hidden biases overall content slanted to white middle- 
class values and culture, or to traditional sex role behavior? 

8. For standardized tests, are the norms provided generall) useful for 
the particular school population*^ If not, are local or other appro- 
priate norms available, or is mformation provided that vv ill suggest 
how to interpret the results for the siuuenis? 

9. How does the school plan to u^e the lest results to help students? 
How will lh<reNulis be communicated losiudentsand iheirparenls? 
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> 10,^ Do the lesls meet the essential requirements of the APA Standards 
for Psychological Tests^and Manuals? What does Buros' Menial 
Measurements Yearbook say about them? 

Parents and students should also be told something about the testing 
program, and why it is being given. Some publishers have prepared 
letters to parents for this purpose. If these are not available, the school 
should prepare its own. If student information booklets containing a 
description ofjhe test and sample items are available, they can be used 
in a briefjest orientation session with students, to put them at greater 
ease in the testing situation. Filling in sample answer sheet grids well 
ahead of the testing date also helps reduce irrelevant sources of error on 
the test itself. 

When test results are available, all who will be involved in the inter- 
pretation should be briefed on the results and what they mean. Report 
forms, profiles, bands of confidence, the meaning of^percentiles, the 
diff*erences between measures of ability or achievement and measures 
of interest-all these should be understood by teachers and counselors 
before the results are disseminated. The school might also want to con- 
sider involving parents and students, especially students at the high 
school level^t some point. Parents will want to know what the results 
mean for the child. What new information has the lest added to what is 
already known about the child? Are there contradictions between the 
test results and other information? If so. how can they be explained*^ 
Finally, both parents and students will need reassurance that test 
results will be used constructively that a low score on a reading lest 
means, usually, only thai the child needs help with reading. 



Conclusion 

I hope I have .succeeded in demonstrating that, although the baby and 
the bath water are still with us, the balh water is much cleaner now 
than it has been for a long im^e. And a lot of effort is going into making 
it still cleaner. 

rd like to close with a quote from Theodore SizerV conclusion at 
the ETS Conference on Testing Problems si.x years ago: 

•'...the testing fraternity needs to c*oncentraie on the effects of class, 
race, and ethnicity on the development of skills and attitudes. It needs 
to help us understand how these factors influence human development 
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over time. It needs to suggest ways of lessening thosJe influences that 
narrow a youngster's options, and ways of measuring the child's prog- 
ress in increasing his options. 

**Testing must not in a benign way serve as a device to preserve the 
social status quo. On the contrary, it must be used to illumine current 
social rigidities— and to help us finally break out of them.'' 
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One Man's View 
' of Testing 



William Raspberry 
Columhisi 

The IVashingiofhPvsf ■ ^ , , 

If I dccasionalK find mvscif rebutting attaeks on starK^rdi/ed lebls, it 
is not because I think the tests are that great. It is beeause 1 think the\ 
are often ajtacked for the'wrong i easons. 

I am thinking, for example, of the attacks premised on the fact that 
blacks and other disadvantaged minorities do lcs> vvell on.starid.t:dized 
tests than do middle-class^white children. 

I am thinking of the blitz of the Saiwnat Ekmenian Unmipal maga- 
zine [Vol. 54. No. 6. Julv-August. 1975] which, in a single issue, devoted 
!8 articles and an editorial to the subject of standardized testing and 
managed to find not one single good thing to sa\ about it. 

I am thinking of the assaults bv peivple wIk) have a vested interest in 
mv not finding out how well, or poorlv. the schools are doing in 
their prima r)' job of educating children 

I am thinking of people vsho scream euitural bias without the fa ui test 
idea of w hat they mean 

I am thinking of people vshose objection is to policies, but vshose 
attack is on tests designed to elVectuale those policies. Thev denouiKc 
screening of fullv qualified applicants to graduate school, for mstante. 
simply because there are fewer spaces than applicants 

And so. although I happen to beheve that the test makers are not 
doing ncar^v a g*)i>d eni>ugh job of devising tests or helping tlu»se vUu> 
ad m mister them to under'^tand their proper use, I frcquentlv find 
myself opposing those who attack tesimg, 

I found mvsell m verbal ciimbat wuh the former superintendent i^f 
sc*^'H)ls m \Vashmgti>n. I) ( . Mrs Barbara Si/emore. v\hen. after 
recentiv published sciircs sluiwed th.at oiu" chiUlren ^vcre pcrforniing 
poorly, she proposed an end fo tcsnng 

I pointed out thai there might have been all st)rts <A reasons whv it 
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would be unfair lo compare icsi .scores of (.hildrcn in Washmgion slunu 
with ihcise of children in -Palo Alio. Bui ii did seem lo me, I said, ihai 
some olher explanaiion was tailed for when \hr resulis showed ihai 
Washingion children were doing less well m reading and inaih man 
Washington children had done ihe \ear before and ihe vear before thai 
When tcsis reveal trends, I said, ii seems'ihcv are irving lo lell us some- 
thing. As a rule, 1 touni ii beiier lo lisien ihan lo ihrow ihe lesis awav. 

True, there were problems wuh ihe ie,sis. ,*\s Mrs. Sizemore poinied 
oui, there is no as.>uranLe ihai u)u are lesung ihe same children from 
one vear lo the iiexi. Nor, without some aiiempi lochari nijgraiion pai- 
terns and changes in ihe >OLioeLononiu paiierns pf iht^siudent popu- 
lation, can one assume ihai lesi resulis reficLi whai happens in ihe 
schools. 

« 

Bu^ whaiever is wrong wuh ihe lests. ihere are some ihings ihe} can 
do. The\ Lan lell, wiihm limits, how ihe tbildren in vour hi)meiown 
slack up scholasiualK wiih the children across umn or .icross ihc coun- 
irv.And ihev can lell \ouhow ihe children in \oursvhooN slack upwiih 
ih^ir predecessors m ihose same >chtH)ls. or whai happens lu a pariicu- 
lar class of studenis during its sehool career. 

These are ihings wurih knowing. Bui some neople do noi w^ni us u> 
kftow iheni. That was m> suspicii)n when 1 read ihe Lkmenhir\ Pr .ut 
pjl maga/mc 1 mentioned earlier Su'ndardi/cd le^ls, a do/en ^nd a 
half auihors concluded, "desirov" children The icnIs. ihe\ s-aid. are 
illogical, niisloadmg, and ma> inspire chcaimg cimiparing people lo 
one another aliing a smgic scale ofabtliiv in fundamcniall) demeaning 
and unfair< 

Noi onK do ihc icsis badi\ wh,iMhe\ allctic u> whai ihe> 
allelic lo do should noi he di nc ni ihc tirsi place " siandardi/cd sci- 
ence achicvenieni lesis tor the clcmcniarv school arc alnuKi uiiH\>rml\ 
poor in qualii\ The) arc incv>rreci. niislcaduii:. skewcil in emphaNN. 
and irrele\ani 

"The scores purpi»ri lo be nicMNurcN o( ihc cducaiional hcaiih of a 
conimuniiv or a schiH>i iUii ni, tad ii vM>uld make as much sense lo 
lake ihe bKu»d pressure o\ ca^h sUid'/ni. appK ihc usual siaiiMicaf 
procedures, and publish the rc>uliN diNiricl b\ district, to niya^^urc the 
health of ifie student bod\ 

The articles UH>k the ii^ual potshots at .nilivuiual test items, m,in\ of 
which are incrcdibh bad. and i»!ten ci>tK!ui(cJ that an\ test containinsj 
such iic^ns IS wi»rsc than useless f or inst.mcc tins niuhiple-clioice 
Item 
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Many kinds of plants are not able to live in the 
desert because of the 

high temperature 
low rainfall 
bright sunlight 
poor soil. 

An) scorer who marked an) one of thosj test ehoiLCs wrong ought to be 
fired. But some of the attackers took exception ti) \irtuall\ all multiple- 
choice questions, including this one- 

What do scientists u.se to make small things appear larger'^ 

a barometer 
litmus paper 
a balance 
a microscope. 

I thought it\\aN a fairl) unambiguous item But the author who cited it 
had this comment. "'What are small things' small diflerenies ,n pres- 
sure? small changes in acidit)? small weights?" 

In other words, according to tlie author, this is \et another 
ambiguous item Not to me. If I gave the cjuestuMi to a science ,iudc:it 
and asked him to come up w:th a justificatuMi for each possible a^^^^er. 
that would be one thing But if I iiave him the qucstK>n. and made him 
understand that there was onl\ one acceptable response, and that he 
would have no oppi>rtumt\ later on tor [ustifvmg exotic answers, that 
would be another thing altogether In that case, if I asked him what 
instalment made small things appear larger, and he satd bari>metcr. I 
would think he either was not \er\ britihi i)r that he was beiniz a smart 
aleck. 

One point i« made again and again \i>rm-referenceJ standardized 
tests do not tell \ou wiini to do abi)ui U^w aciiie\ement. niir di> lhe\ 
prescribe remedies M\ response is that the instrument |\inel i>n the 
dash board ol m\ car includes ,i speedometer, which does ru»i tell me 
wh\ m\ car IS not going faster, a temperature gauge which does ni»t tell 
^ne wh\ It IS running hot and a ch»ck. which di>eN ma tell me wh\ I am 
late and what alternate routes I might take to make up the linic iUilil 
-dc^es not follow tliat tliese are useless instrumcntN Sometimes it is \cr\ 
helpful to know something is wrong 

Now. what of cultural bias'^ Well, it depends on what \ou arc talk in 
about 

J// 
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Everybody knows what cultural bias docs, it causes inner cily blacks 
and other disadvantaged children to score poorK on intelligence, apti- 
tude, and achievement tests. ButhardI) an) body seems to know cxaetl) 
what it is, or whether it is a correctable condition. And as a result, there 
IS a growing demand for its solution bv radical surger): get rid of the 
tests. Unfortunately, those who (end to be most victmii/.cd by cultural 
bias, whaiever it is. are least in a [position to dictate an end to testing 

There arc two main propositions concerning testing and cultural 
bias. They often coexist in the same argument and occasional!) are 
cornniingled in the same sentence. 

The first is: standardized tests, because of cultural bias, do .lut accu- 
rately measure the capabilities of black and other minorit) test-takers 
The second is. standardized tests ma\ be a more or less accurate wa\ of 
testing capabilities (though not native intclhgcnLC) but. because of cul- 
tural bias. thcN test those capabilities at which the middle-class and 
white, rather than the poor and black, tend to excel 

The first sa\s test do not do vers well what the\ allege to do. as far as 
blacks and minorities are concerned The second sass the tests are 
designed to uncover virtues which the dominant society deems impor- 
tant aad not others which it considers less impoitant One of the things 
that IS rated important is the degree to which applicants have absorbed 
and internah/cd the dominant culture, including those things normallv 
taught in .schools. 

When vou put :t that was . it becomes clear that the test is supposed to 
beculturalK biased. That' is one of its purposes. It might not tell you 
whether the learning, the acculturation, took place in school or at home 
It will not measure tho.se aptitudes and achievements th.it the test- 
de.siiiner was no! lookmg for. and it will not tell >ou anything definitive 
about native ability 

But if youi purpose is to know how much of ^'A" a child has absorbed 
m order to know when to proceed to teach 'Mi ", standardized tests can 
be Uhcful unless, of course, proposition one is true, in which case the 
test will tend to undermeasure the knowledge of black children 

Becau.se of the confu.sion i)\er what cultural bias i>. attempts to rem- 
edy It have.shot olT m a number of directions Some have attacked tests 
that purport to measure reasoning abilitv as being, in realitv. tesis of 
.socioeconomic statusand vocabulary Take, for instance Candelabra is 
to candle as chandelier is to (a) book, (h) Ben Hur. (c) light bulb, (d) 
elaborate Some critics would see this as an ob\ lous example ofcultural 
bias. What does a krd from the ghettos or the bairios know about chan- 
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dclicrs and candelabra'' Thc\ niiahi recast the cjuestion to read Finger 
to wrist as toe is to (a) dhow, (h) foot, (c) tap dance, (d) ankle And 

the) might be stunned to see their ghetto Noung.sters niiss that onv too. 
For it is beginning to appear to me at least, that^ the cultural bias in 

tests i<; not al\<%y>s m the \oeabular> and content but in the fornuThe 

bias ma\ be in the test ^juestion as a Je\ice for unco\ering reasoning 

ability. 

Robert Williams" BITCH test (Black Intelligence Test of (\j|tural 
Homogeneity) sidesteps the problem b\ testing lor \ocabular\ on!\. 
And sincuthe vocabular\ is b.ised almost e\elusi\el\ on ghetto usages. 
Dr.. VVilhanIs' test also produces higher scores for blacks than lor 
whiles - a sort of reverse cult^iral bias 

But not reall\ Vocabular) testing is too limited a solution, it does not 
tell us enough. Or Wilhams says he isvsorkiugon tests that will do for 
quesiionsfn logic what the BITCH test has done for questions m \ocab- 
i3lar\ He did not sa\ \\\u\ he call this second-generation test, and i 
did not ask. " - ' 

No mailer The solution i^ not to come up with cute thinsis that 
reverse the usual black-wbite scoring patterns The solution is to do 
what we can to gi\e poor black and other nnnontv diildren the sort of 
background and support knowledge that have currency urthe coun'trv. 
to increase ihcir opportumnes t\>r escaping the crippling ellccts of po\- 
crt). and lo help them pass tests 

Meanwhile, there are a few things |\| hke to talk to test makers about 
I wtnikl like to hear them explain the necessitv of disirrbutmiz popu- 
lations M\ under^standing is th.u one of the icquiretuents of standarCf- 
i/ed tests IS that the\ distribute the tested groups into bcll-shaped-cur- 
ves The) are \er\ clever at doing that at making each mdi\idual test 
Item do that But I am not sure i understand the point of it 

What would seem to me to make more sense is to devise tests calcu- 
lated to determine how much of what h to be taught has in fact alread) 
been learned Thai wa\. \ou would not have to throw out an item )ust 
heeause too n^anv people got a right Vou wtuild have a device that 
tested c^Mldren against the course material, rather than against each 
other Comparisons would still be p(»ssible. ot coarse, but that would 
not be the whole point ^ * ' 

It strikes me as particular!) pointless ti) construct tests those bell- 
shaped monsters for graduate. record exams, medical aptitude exams 
and LSATs. because much of the weeding-out process alrcadv has beeri 
accomplished ^ * 



One Man's View ofl Testing 

I would like to see the test makers and test users agree to try to come 
up with an instrument that is capable of establishing a cut-olT point 
below which success would not be pr^^dicted but which would make no 
effort to rank those who score above the cut-olV. 

And I would like to see the test makers do a much better job than 
they have done m increasmg public understandmg as to just what their 
tests arc s^upposed to do. 
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The Student and Testing 



ThelmaT, Daley 

Career Education Specialist 

Board of Education of Baltimore Count \\ Maryland 

As an organi/dtional person, I ha\c been in plcnar\ scssion.s uhcn the 
quiet and seeming!) perfunLtorv motions .sometimeN bordering on 
being soporifie ha\e erupted as forLefuil) as a supposed!) sleeping 
M)leano suUJenI) found bekhing and emittuig tons of MVightening" 
— laVa. Likened unto the \olcanic action has been the mo\e to place a 
'moratorium a live-vear moratorium, a one-\ear moratorium, an 
indefinite moratorium on all tests stand'ardi/ed tests, that is I ha\e 
witnessed the widespread debate on the \ a nous issues conLerning test- 
ing. I have heard columnists, commentators, journalists, ex; erts. and 
neo-experts on the subject. » 

The great debate continues, ami although the noilc of the student 
ma\ not be seen or headlined as one of the great debaters, ihe issues are 
irrelevarit unless the\ relate to the human test taKer the student in 
fact, 1 wonder whv the widespread debates seldom see students as de- 
baters, a search oi recoi'ds does not reveal a moratorium called b\ 
students. 

In an educational era of accounlabilits. students m,i\ read about 
their achievement (or lack of achievement) m the major newsp^ipers 
almost on a dailv basis or ma\ hear then coliectise perform,! nee dis 
cussed over the local tele\ ision channels A tspicai example is the front 
pagestorv in the Tuesdav. October 19. 1976 edition of the Sew s . \mcn- 
lan (Baltimore). "Students'Still Lag in Tests." which in part states that 
"pfipils' scores on st.indardi/ed tests of basic reading and math skills 
.showed some improvement last vear. but average stores for three of 
four grades teslcfl remained in the bottom 30 percent ot a national 
sampling." 

Tests are designed, manufactured, and distributed for t,ikcrs Not a!! 
tiikers of tests are students, nor do all students necessarily take tests 
However, it has been stated that, as .i. nation, we administer over 200 
million achievement tests each vear This figure represents onlv about 
65 percent of all cducahonal pssLhological testing that is Larried out 
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In earlier treatises, principals reported that some standardi/ed tests 
were given in their schools each )ear; In l96l,Goslin reported that 100 
million abini> tests a )eaf were being taken b\ persons in educational 
institutions.' Later, in 1964 it was estimated that 150 million to 250 mil- 
lion tests a )ear were being administered.' Of 714 elementar} school 
principals in the Russell Sage Report, onl) one reported that his school 
had not had plans to initiate a standardized testing program * 

In addition, the Coleman surve)' reported that o\er 90 percent ofthe 
nation's pupils'were in schools where intelligence and achievement 
tests were given at both the elcmentarv and secondary levels 

Besides intensive testing programs, external testirtg programs, such 
as the Preliminar> Scholastic Aptitude Test (PSAT). theTNational Merit 
Scholarship Quaiirving"^T^st (NMSQT), the Admissions Testing Pro- 
gram (ATP) of the College "Enlrance examination Board (CI:EB) and 
the American College Testing Program (ACT), Armed Services 
Vocational Aptitude Batterv (ASVAB), the General Aptitude Test Bai~ 
ler)~(GATB), Civil Service Examination. Bcti> Crocker Search for 
Leadership and Fatuilv Living, all add to the number of tests adminis- 
tered in the school each vear. This doc^ not take into account the tests 
given at midterm, at the end ofa unit, or the lest given vindictivel) as a 
disciplinary measure. 

With the increasing number of tests and the growing ^luesi for the 
rai.son d'etre b) students, one must have available the u/n for the test 
and the proposed use of the results. Tvpicai uses (though inan> times 
liiven in a circuitous, incomprehensible wa\ ) ma\ be (I) to select for 
college admission. (2) to group. (3) to identifv needs. (4) to help stu- 
dents select courses. (5) to aid m career planning. (6) to evaluate pro- 
grams, and (7) to provide infornialion which might be helpful u\ secur- 
ing facilities, gaming new resources, and providing research data 

The student cares verv little about the research data, the account- 
abilit) sliraics. the evaluation of programs The student does care if he / 
she can visuali/.e immediate, concrete, relevant uses'. 

I have witnessed large testing sessions in school auditi>riums with lap 
boards serving as improvised desks, veiv poorlv defined lest goals 
(other than^hal the lest was rc\]uired of all tenth graders and it could be 
used to predict the next levels of achievement), and students who could 
not care less. Students tjuickjy exhibited their displeasure nonverballv 
b) rapidlv running through items liincd for 20 minutes in less than 10 
minutes and spendmg the remainder of ihc*^inie buckling lap boards, 
while the lop 2 percent studiouslv raced against the licking stopwatch. 
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and proctors silenti) but forccrullv nu)\cd fri)ni tow to row with buck- 
lings commencing as fast as a buckler was silciKcJ. Students todaj do 
not view massive general achievement or general scholastic aptitude 
testing as relevant to their needs. The) will qutckl) ask. "What good is 
that to me?" " ' 

However, in an era of accountabilit) , legislativ el) mandated, s)stems 
that had all but aban'doned statewide or unit-wide testing programs 
have once again en*icted them. In nu own state's accountabihtv pro- 
gram, the implementation plan required the establishment of a com- 
prehensive and uniform Matewide testing program. The Iowa Test of 
Basic Skills (ITBS) and the Cognitive Abihlies Test (CAT) were 
selected as the statewide assessment instruments. Since the spring of 
1974, all pupils in grades 3. 5, 7. and 9 hav e been tested on ihe ITBS and 
the Nonverbal batterv of CAT, and the Marvland Basic Skills Reading 
Mastery Test has been assigned grades 7 and 1 1 asofihe fall of 197,5-76. 

It is hoped that, as an important aspect of this assessment fabric, the 
results will provide teachers and schools with a basis for improv ing the 
' qualit) of their efl'orts on behalf of the students However, the capacit) 
of a system to generate data is usualK greater than the capacitv of 
teachers to use the data 

Let me advance to some nontcchniLal aspects oC testing that verv 
much alfect the student 



The Administrator— The Interpreter 

The person who administers the test mav have a negative elfcLt on the 
examinees or the students Sacks' ' (ouwil that stildcnts' stores nKreased 
if a gcx^d examinee-evammcr rekitionship was established prior ti» 
a test. 

Some writers, such as Fadilla and Ga/da, allude that the examiner 
can. maximi/x* or minimi/e the Lhtld\ performaiiLe (on an individual 
test) by his or her actions Similarlv. by nusinterpreting the thild's re- 
sponses, the examiner Ltin signifiLantly raise or lower the final indi- 
vidual intelligence (IQ) score. 

In mass testing, such as actountabihiy testing. maiiV times teachers 
are examiners who have never been involved beftjre and who mav i\oi 
< have gone through *t full orientation l.*iLk <>f kiU)wledge and general 

information on the part of the evamincr is ultimatelv detrimental to the 
student. Although I have no dataao prove it, iriiiny teaLhers apprt)aLh 
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Shid«nU' View of Testing 

testing sessions \Mth \er\ negative attitudes. In fact, when the notice 
arrives that a SCAT Test will be given b) all English teachers, nianv 
are heard to exclaim. "Those things! There goes m\ good Hnglish 
period."This altitude is bound to be reflected in the studemV perception 

Somewhere along the hne. test administrators and test interpreters 
must meet minimum standards. Thi> is recommended for stanJ»'irdi/ed 
tests, however. 1 would go a step further and recommend that all 
teacher?* be inserviccd in test making, test taking, test administration, 
and test interpretation. 

In the McCarthv studv. third and fourth grade children who wrote 
a composition on "The Best Thing That Fiver Happened To Me." prior 
to a test, averaged four to fn». points higher than their scores on the 
same lest taken after w ritmg on "The Worsi Thing That Ev er Happened 
To Me." Tvler' showed that an examinee's experience immediatelv 
preceding a test affected his/her test performance. Kirkland* stated 
that a *'warm" versus a "cold" interpersonal relalion. or a rigid and aloof 
relation versus a natural manner on the part of the examiner, inav 
aflect the examinee's responses. So, m fairncVA to the student, the 
examiner-interpreter approach must be addressed 



The Student and The Purpose 

We test for manv. manv reasons, however, the student deserves to know 
vvh\ the test, ACT and SAT are popular because the purpose is clearlv 
defined and understandable (not nccessanlv acceptable) io students 
The PSAT/ MSQ T purposes arc understood but become a major dis- 
appointment to students when the financial ..spects peter out O'r the 
majoritv. Manv are led io take the test with the hope that scholarships 
might be at the end i>f the rainbow, onlv to find out that the rainbow 
never appears 

There is considerable anxietv and tension associated with the taking 
of tests. In mv counseling experiences. 1 witnessed students who have 
lilerallv become ill on test davs. Some i>f these same students faint and 
become hysterical on report-card davs 

There are manv hidden reast^ns whv tests are given. St)mctimes the 
stores arc used to rule on eligibilitv for a basketball team, another time 
thev might mean mecling graduation or grade level requirements, thev 
ma^ riiean entrv into the Armed \ oaes. a job. or acceptance at the 
colfege of one's ^hiMcc. or thev might mean remedial i)r prescriptive 
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work based on (he diagnosis. \Vhalc\er the purpose. iheMudeni should 
be fully apprised. Ifihe le.si is lo salisf) parental pressoies. this too 
should be elcariy displayed. 

If a lest is given for diagnosis, the student should see the results \ia a 
developmental program. As an example, in Mar) land's aeeountabilit\ 
program, reading lest results are being used to select s^hooLs to re^ene 
special assislance. This \ear a special project, called Projett STAR 
(Standards Technical Assistance Resources), is in operation with 1 1 
elementary schools showing a need in the reading area. Tour spcLiahsts 
in the areas of reading, language development, guidance, and Lom- 
nuinily involNement. plus a STAR resource teacher in ea^h of the 
schools, arc working with the local staff to assess iheir Lurrent reading 
programs and develop a plan for upgrading student prolkienL\ to stale 
standards Inlegral to the project is a monitoring and evaluation sNstem 
to measure growth of Miidenl achievement and staff development. 

Nciilinger'' in looking at attitudes of Xmcruan scLondarv sthool 
students toward the use of tests found that anti-icst sentiment is 
neither ubiquitous nor consistent. His data showed that not cverN 
sludcnuor e\erv group of students to whom wc administer a test, holds 
negalive opinions about testing, fiis findings did indKate. howe\er. 
that a student i.s quite likelv to be nuonsistenl m Ins or her attitude 
toward testing One maN favor testing in iMici^ontext and disappro\e of 
it in another. Neulingcr found that students' attitudes toward testing 
were related lo social background and personalitv charaLicristus. He 
interpreted his findings {o indiLate that a student who is a mcmlnr of 
the lower lLiss. frt)m a less well-ediiLated background. wlu» is less bright 
and knows it. who has limited aspirations and \iews t)f the world in 
latalistiL terms, reacts to tests quite diircrcniK iroiw the resptnuleni who 
is from a better educated background, who is bright and kno\\s it. has 
set high goals, and thinks the world will conft»rm to his or her wishes 
F'or the upper class respondent, tests helped him or her io idv-ntifv as a 
member of the elite lests were instrumental m getting the student 
into the better schools * 

I he student in the lov\er socioecoiumitc a nil less educated dtunain 
saw the test as idenlilvmg him or her but not as .i member ttJ the elite 
The iden!ificaiu>n was the equivalent t»f being degraded I he school, 
which IS supposed io upgrade his or her abilities (a> students see it), 
condemns the student before he i>r she gets a chance I he test excludes 
him or her from places of higher learning 

Neuhnger concluded that students saw tests as being used b\ st>ciet\ 
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as a tool to difTcrcntutc among people m \\a)s that ha\e real conse- 
quences. Only to the degree that societ) is fair and just in making these 
discriminations will people agree that a is fair and just to use tests 

Kirkland*. in a study on the effects of tests on students, stressed that 
the student^was the one whose status in school and society is deter- 
mined by test scores and the one whose self-image, motivation, and 
aspirations are influenced. Tests do affect the self-concept of students, 
but it is important to note that the was a person views himself/herself 
also influences test behavior. 

In terms of motivation, exact!) how students are motivated b) tests 
has not yet been conclusivelv demonstrated. Most findings indicate 
that feedback from test^ promotes learning, assuming that the student 
attempts to do well on the test. Students with negative stores detest the 
frequent feedback which lends to increase the level of low motivation 

Lev el of aspiration seems highls related to self-concept and motiva- 
tion. Moss and Kagan" in their longitudinal stud) of intellectual 
progress and achievement, concluded that the child who attains 
scholastic honors is rewarded bv those around him and that this ex- 
perience frequenti) leads to an expoLtancv of future success for similar 
behavior, thus mcreasing the prohabilitv that the child will continue in 
such tasks. Failure would result in the i>ppi>site behav lor such as av oid 
ance or withdrawal. 

As individuals meet vvith success, their gv>als and aspirations rise in 
accordance with their increased confidence. Students who v\ere tested 
most often and best inronued about their perfv>rmance were the ones 
most motivated to acquire additional information 

Anxiety is another big issue with siudents I'hcre is considerable ten- 
sion and anxietv about taking tests, f uuiings \u\ e indicated that an uctv 
scores correlaic negativelv with I Q and achievement lor ihe so-called 
middle and low I Q groups 

The Student and the Testing Environment x 

Most tests are given in such uogodiv places as ihe cafeteria, with hiivfcd 
biickless bench seals aiuid the raitling ol huge meial vats and thtjMro- 
malic i>di>rs of near-done meats. ciuJing desserts, and boiling soups 
the dav's menu Manv are given in diniU lit auditoriums, and ik- 
casionallv ihc gvm is readied with chairv and pn^ctors Long tune limits 
and ihe absence of independeni divisums wahm the lesi sonieiimes 
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make It impossible to administer it in the a\ ailable site. Although at this 
writing the College Board's Blue Ribbon Panel has not aired its reasons 
for the score ^decline. I feci almost assured that testing environments 
may play a role. 

The Studenf and the Score 

In a recent report. Roy Forbe.s. Director of the National Aysessmenl of 
Educational Progress, has pointed out that no testing instrument re- 
veals everything about the qualit) of education students are reteiMng^I 
can recall the ner\'ousness, the excitement, the hugs, the tears, the 
absolute look of failure when^tudents ha\e received test scores. In a 
quote by a student cited b\ Cottle*, a voung man who had just learned 
of his performance on a set of standardized athieNement tests said. 
*ifyou eliminated mone\ in our societs. \ou tould eliminate tests and 
all the test scores." He said. "So. to be Americ.ui means that \ou ha\e 
a lot of money. No matter what \ou earn. >ou aren't satisfied until \ou 
have more than the ne\t gu\ That's the same thing \Mih tests. Gi\ mg 
us our score isn't enough. the\ ha\e to gi\e us the percentile rank as 
well, Nobod)'s supposed to get 690 and think the\'re realK special The 
counselor tells them right aua\ that 690 nia\ sound goini, hut a\ (mix 
the 80»h percentile. You ha\e got to lia\e nionc) and \ou ha\e iiot to 
have IQ. PSAT, and SAT points Americans numheis .md quanti- 
ties Big IS the name of the game Produce and get bigger. Inches, 
pounds, dollars, points on tests, ar^ all ansbod) cares about. e\en the 
minority sIuden^ in t>ur schiH)l Nobod\ asks whether the\ arc happ\ 
All people want to kium is whether their achie\emeni scores hase iitwic 
up, or how mans points the\ scored in a basketball i!amc " 

The siK'ial consequences of the score has become a new area ol m* 
terest in this decade The issues, according to TbeP. center around such 
stKial consequences of testing as 

I I he) ina\ place an indelible stamp i4 nileruu intcllcvtual ^laUl^ on 
a child, rum hisMier self-esicem jnd educaiKui nu'tivation and 
determine his/her s(>cial staUis as an adult 

2, They ma\ foster a narrow voiicepiu)n t-l abdi:\ and rcj\ive the 
di\ersil\ of talent available u>scho(»ls .md ^olic{\ 

3- They ma\ place educatuui and the dcstin{«.s i>l uulividual human 
beings under the comrol oficst makers 
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4. They may encourage impersonal, inflexible, niechanisiic processes 
* of evaluation and tielernnnalion 

For ihe.sludcMii, m UX) man> cases, there is finality in the test score 
The student is marked, grouped, or tagged. The score is phiccd on the 
cumulative record card! and teacher after teacher rcalTunxs his/her 
belief m the student^ incptness, later, emplojers sec the score and 
readily accept that a retardate is appl>mg. and parents get the cold 
sdore and quietl) brood, wondering what the\ did wrong in prenaial 
care and subsequently develop guilt feelings, Consequenth, they over- 
uidulge the child and. uliimateK, foster negati\e behavior One little 
score goes a long, long wa\ 

' Folds and Ga/da^found that mdi\iduai test interpretation, small 
groi^ test mterpretation, or written test interpretation resulted in more 
accurate sclf-eslimates of test scores than was found in control groups 
receiving no information, I contend that aceompanMng every test must 
be a descripu\e supplement dealing with creatne, informative, and 
positive ways of Te\calmg ic » scores to parents and students, and also 
lo.leachers^v ho quite often forget the interpretation. Descriptive trans- 
paVcncies. demographics, and^'lear language are desirable tools that 
counselors and leachcKs w cicome nlonu with teCt results 

The Stale of Manland 1ias, developed an occasional paper^on ac- 
coiintabilitN entitled,' Impnnlno StUik ni \)inmks ami Skdhjor Taking 
feus ' Among others, the publication stresses that teachers,' even tlie 
directors, should knowthc characteristics of the .indents, create a sup- 
porting en\ ironmenl, and a\oid interraplu>ns Teachers aic encouraged 
u> prepare Mudents for taking rests, tell them wh\ the\ arc taking the 
tests, how the results \k\\\ be used, and-how the tests are scored ThcN 
arc urged to iram students how to take tests. u> teach them the specitic 
thmkinu skills required on tests, and to. inform them v>f the teacher's 
role during the test- They are urged to simulate lest-taking conditions. 

I remember %er\ mmcIK a special education student whose name 
uas ( \ ('V Csas a talented, tail, bUk, r.-stlcss male who pla)cd the 
juntar, the drums, and sang, ( \ hated. Ine/all> haled, his special educa^ 
tion classes bui tests said that was where he beU^ngcd C'\ defied die 
scores and roamed the halK, dcMscd wa\s ot av^ding the leachcrs on 
hall dul\. slipped to the shi>p lo create slipped to the music room lo 
syncopate, slipped to the art rot)m ti> ^^atcreolor and slipped to the 
2\ni tt> make Uso pomts ( > (inalK shpped out ot ^lght because his 
tests labeled him special labi-led him dumb 
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In the words ot ihc \ \ \('P Rtport on Mmonn h^iin^i icsts must 
predict aLLiirateK uhai thcv pri^niisc. tests must nKM>ure ,Klcuuatel\ 
ihc content i>f the area the\ purpi»rt ti» ci)\er aiul the tc^n^l: proiiram 
muM be capable ot^ leadiiii: ti» prcs^^riptii'tis which result \\\ positive 
grovvth for the person'^ (the students) beinir tested 

These are m\ rertections i>n testing and the student 
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Where^ Ignorance Is Bliss— 
'Tis Folly to Be Tejsting 

Robert L. Thorndike 

Professor Emeritus of Ps ychology and Education ^ 
Columbia^ Unhcrsity 

There may have been ?i few of vou who. Vvhen reading the title for m) 
remarks, bristled slight!). "Who does this joker think'hc is." >ou may 
^ave said, "equating testing with being wise?" And f can't agree with 
you more! But the fault really lic^ in the old adage because the antith- 
esis of being ignorant is to be inforjnied not to be wise. WivJom. like 
beauty, lies^in the eye (or the cortex) of the beholder, 

to be informed sounds, on the face of it, desirable like baseball, 
apple pie and Chevrolet. But we need to ask. To what end does-it profit 
us to be informed? And the uniform answer, it seenib to me. is that we 
wish to be informed so that, being informed, we can make better 
decisions. Some folks ma) treasure information for its own sake, as 
others treasure bits of string, match bouks^pr rubber bands, but to most 
of us the fundanientaKvalue in being informed licvin the decisions that 
can be based on that information. 

If that be so. our basic problem as makers of testx peddlers of tests, 
or instructors in the use of tests is twofold. It ls.. first, to determine what 
information is useful in relation to what types of decisions and then to 
make it possible for the decider to get that information, ft i.s. ^econd. to. 
try to bridge the gap from information to wisdom so th.it the ^.levant J 
information is used with perceptivcnes:> and rcsUamt to lead to dctw' 
sions that will foster growth, success and happines.s for :hc individuals 
or groups concerned, Mv remarks todj> uill bc directed primani) to 
the first of these two problem^, with the hope that there mav ba.rlittle 
, spinoff on the second. 

It is important p recognize thaltherc are a number of different types 
of decisions for \vhich the information provided by testing may be 
relevant, and that the information needed for one type is likely to be 
quite dilTerent from that needed for ..iiothef. The information needed * 
by a teacher deciding vv hcther (o rev lew capiiali/.ation of place names is 
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of quite a different sort from that needed by a twelfth grader trying to 
decide whether to apply for admission to Harvard, I would like to 
review some of these types with you and comment on the sort of infor- 
mation, and consequently the type of testing, especially achievement 
testing, that seems most appropriate for each. 



First, let us turn our attention to inStrpctional decisions. These are 
decisions, usually by the classroom teacher, of the type: ''Mary know^ 
what a prime number is. We don't need to teach her that, and can start 
herin on factoring/'Or^'Williecan'tteHacomplete sentence from a frag- 
ment. He needs help on this." A sound decision on whether to teach or 
not to teach topic B depends, in part at least, on information as to 
whether ajstudent^or possibly, most of the students in a group-has 
mastery: first, of topic B itself, and second, of topics A,. A,, and so on 
that provide the foundations for topic B. If the student (or class) has 
^already achieved a satisfactory level of mastery of B. to spend addi- 
tional time teaching it seems a waste. On the other hand, if the student 
orclass that cannot do B does not havecommand of certain of the A's. 
and these particular A'^arc ra;//v essential to learning B. to plunge into 
B without first mastering the A^s seems likely to be an exerci.se in 
frustration and futility. This was the credo that motivated the authors 
who developed tests such as the Compass Diagnostic Arithmetic Tests 
back in the 1920s. And a revival of this credo appears to be what started: 
the wave of enthusiasm for criterion-referenced tests in the past decade 
To the extent that aspects of the curriculum ^;re sequential, to the 
extent that one identify certainskillsor Certain bodies of knowledge 
that are necessary aiyecedents to successful study of other skills or 
bodies of knowledge and to the extent that one can define what con- 
stitutes an adequate level of mastery, this approach seems sound. BulJ^ 
believe that my "lo the extent thats'' represent very severe constraints 
upon the breadth of applicabihty of the ^'criterion-referenced" ap« 
proach. It is no accident that most of the examples of criterion- 
referenced testing are drawn from arithmetic. Arithmetic is the aca- 
demic subject that comes closest to comprising a sequential set of 
identifiable, discrete skills that can be fully ma.stered and in which later 
skills build upon the foundation ofuhat has previously been learned 
In primary reading, some of the basic word analysis and decoding skills 
may haveVmilar status as essential contributors to fiuent reading. And. 
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of course, there are numerous^specifk rules in language uj>age and else- 
where that represent teachable and testable specifics, even tliough the\ 
are not sequential in the hen.se that master) of an) ane \s essential to the 
teaching of any other. But much of school learningdealswith material 
that is neither sequential nor organized in neat packages that can be 
fully mastered. What are the boundaries that define, and what consti- 
tutes mastery of reading comprehension or of the French Resolution? 
These represent broad domains the one of j>kiil. the other of know I- 
edge-for which prerequisites or successor^ would be hard to specif) 
and for which the concept of "master)" af the 80 or 90 percent level 
seems to lose all nieanins. 

Even with fairl) specific and definable skills, settmg a standard of 
mastery can be a trick) business. Consider the rather-precisol) defir -d 
objective: VVhen shown a 2-digit number, specifies whether or noi it is 
a prime number. Relati\el) few in this room would assert that 25 or .S8 
are prime numbers, and the few >ho did would be persons with no 
conception at alLor groj>s misconcepti^Tiis of what a pnme number is. 
However, nn experience with previous i:roups like this indicates that a 
good man) of voii would unhesitatmgl) identif) 51 or 9 1 as being prime 

though of course the) aren't. Here, as in man) other cases', it makes a 
world of diflerence which excniplar>> one ch<x)scvs in order to test m*ister\ of 
even a sharplv deh'mited skill domain, and for main students, whcthci 
one will or will not conclude that the) ha\e achieved master) in terms 
of some specified proportion of successes w ill depend cnticalLs upon the 
specific tasks that havc been chosen tocvxemplif) the domain. 

As a minor detour, it seems to me that from a ps\chometnc point of 
view assessment of real nunicrs is most eflkientl) achlc\ed b) using 
tasks that represent the more difticult exemplars of the JoniaMi. so long 
as they do not introduce other aud irrele\ant sources of Jifiicult). We 
test prime number master) better with 51 than with 50. niasterv of the 
basic addition combinations better with 74-8 than with 2 + 3. Success 
on the easy items tells \er> little about master), though it ma) signilN 
a good-beginnmg, success t)n the hardest items tells a lot. 

Another point to be borne in mind is that the peiformance of a 
learner who is jusi picking up a new competence lends to llucluaie from 
da) to da) and week to week. One doctoral student wah whom I 
worked studying foreign students w lio were learning laighsh. and usint* 
a set of mini-tesis of specilic Lnglish usages, found marked!) lower 
consistenc) between two tests given onl) a week apart than wiihiiUhe 
ftems of a single test, the rcspecli\e reliabilit) coenicicnts for A-Mcxw 
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tests being respectivel> about .80 and .60. Two separated short tests to 
appraise mas.tery should permit a wiser decision than a single test twice 
as long. 

Thus, eriterion-referenced tests built of the critical CKaniples of a 
defined skilL perhaps repeated to check upon stabilit) of master) , can 
in certain limited areas provide the information basic to wise instruc- 
tional decisions. 

But a wide range of other decisions that arise in the process of 
education call for information on the performance of indi\iduais and 
of groups. We can distinguish selection decisions, placement decisions, 
decisions involving curricular choice and resource allocation, and a 
whole set of decisions that we might call guidance decisions or personal 
decisions. What sorts of information provided b) what sorts of testing 
instruments will permit decisions of these kinds to be made more 
wisely? Let us turn our attentionTor^a bit to selection decisions, 
- Implicit in the verj concept of selection is a situation in w hich there 
^ are more aspirants to some particular good, be it admission to a pro- 
gram in veterinary medicine, a berth on the Dallas Cowboys, or an 
executive secretary's job with the president of Widgets International, 
than there are poMtions to be filled. There is the often painful task of 
choosing among persons all of whom ma> be at least minimally 
qualified, trjmg to pick the best, or at least the better qualified from 
among the applicants. The regression of some index of job performance 
upon score on a predictor test represents one t)pe of information to 
guide such a decision. 

We have in the past tended to \iev\ the selection enterprise in terms 
analogous to the economist's cosi-benelit anaKsis. More efficient 
employees can be considered to generate benefits to the employer in 
improved productivii). at whate\er cost is in\olved in a recruitment 
and testing program. But m education v\e dare not take quite as narrow^ 
a view of costs and benefits as might be acceptable for the professional 
football coach, or the industrial personnel manager. The benefits can- 
not simply be represented bv grade point average, but need ?o take 
account of the broader utilU) of the person in the larger society An 
adequate medical student who v\il! pri»vidc service m the urban ghetto 
or the rural South may represent greater social uiilii) than a brilliant 
one who will compete for patronage in a middle-class suburb 
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Dccisioas involve not onl) those fads thai lesiing can supply, but 
also a value system that has nothing io do uith tests and testini! This 
is the fundamental consideration that has been back of nmeh of the 
debate of the past decade about **fair testing" e\en though it has 
seldom been identified as such. Competing defimtions'of what is "fair" 
differ primarily not on psvchomeiric issues but on the queMion of 
whose utility is paramo^int. The classical approach that used tests and 
nny.other available inf^nnaP to establish for each indnidual a pre- 
dieted academic or job perr. . ...aice, and then selected those fo/ w horn 
the prediction was highest, adopted a view narrow!) focused on the 
employer's or selector's utilit). This harrow \ie\\ ma) be acceptable 
in the footbalj coach whose values must focus sole!) on winning as 
many games as possible. It becomes more questionable in an empluvcr 
vyhose decisions structure the job .opportunities for large segments of 
ourfociet). and still more qucvstionable in the admissum.N oflke i)f an 
educational institution that exists only to serve society. Utilitv in a 
graduate from a college or professional school must be viewed not 
solely or priiiiaril) in terms of grade point average nor of income, 
X years out of college, but priliiiaril) in the broader sen,se of value to 
society. This is, of course, a fuzA, ambiguous n<)tion. and there will be 
wide dilTerences in perception of where the common goi>d lies. But 
unless we can achieve consensus on such value questions, n<) amount 
of psvchomeiric elegance or refinement w ill brmg us to agreement. It 
is important. I believe, that we recogni/e that thiN is where the shoe 
pinches. Perhaps we can develop a calculus of v^Uucs that will permit 
us to sj^ecifv our utilities and to clarify our dilkicnces in the utditv 
that we attach to dnTerent^DufcoiUes, butTor the pitsent^NUch a caLulus 
seems quite a remote prospect. And even clarification will not guar- 
nntee agreement. 

, Hovveverr^z/c' element in ans judgment abi>ut utility ls the probabilitv 
that the candidate will perform sati>factorilv in the tasks to which he 
seeks acceptance, Ifiuv well will thi> candidate niastcr the mvsteries of 
torts or the skills i)f operating a Selectric typewriter? And what niiKlel 
of test will provide informaliiHi that will be useful in indicating the per- 
formance that we can expect from candidates X, Y and Z? r submit 
that it is likely to be some general asscNsment of a br<)ad area of 
knowledge or skill With what speed and undersianding diies Can- 
didate X read social studies material n<)t, what is his mastcrv of (he 
economic geographv of Fka/iL Him well di>es the secretarial candidate 
spell a broadly representative sample of words not has he or she 
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mastery of the ei/ie rules (and their several exceptions). A good old-line 
survey test, with expectancy tables that indicate probable criterion per-, 
formanceateach score level will permit us to be more usefull) informed 
on the individuaPs prospects for effective performance than will a 
narrowly focused criterion-referenced master) test of some highly 
specific skill. 

* the same thing is true, 1 beUevc, for a wide range of placement, 
guidance and personal decisions. The t)pe of inform«nion that could ^e 
•useful in deciding whether a fre,shman would be likely tolearn more in 
ihe remedial English section, the regular course, or a special course in 
literature or m writing would be a broad appraisal of writing skills, of 
competence ir readmg hterary material, or conceivably of knowledge 
of grammar and syntax rather th^m a focused mastery test of use of the 
semicolon or of agreement between subject and verb. A personal 
decision to apply to^Harvard would be more ,souhdlv based on a broad 
survey measure of high school achievement with performance com- 
pared to norms for other high school juniors than on a mastery chem- 
istry test on the.periodic table. 

Even decisions relating tocurricular modifica.tions or resource alloca- 
tion would seem to call pnmaril) fof broad appraisalsof the compre- 
hensive set of objectives that the school s\ stem is tr>ing to achieve. As 
a matter ^of fact, there is little case being made for the narrowly 
focus^ed criierion-referenced*\nasler> lest as a basis for curricular or 
resource allocation decisions. The current watchword here seems to be 
Vobjective^refcrenced." This appears to mean that the school system 
states in detail, with a good deal of specificit) and usually at great 
length, just what Its instructional goals are, and that each test exercise is 
designed to assess some one of those objectives One can hardly 
quarrervviih a test design m which thctest.excrcises are built to match 
the content and process objectives that seem important as goals of 
schooling. Every achievement test worth its salt has alwa\!> been built 
aiound a blueprint of curricular objectives. The question would seem 
tobc vvheifier tht objCLtivcs of schoohng arc sulficientlv dilTcrenl from 
one school syi>tem to another for o be desirable to prepare a separate 
and unique array of test exercises lor each^ 

Undoubtedly there are msianLCs in which objectives are local and 
idiosyncratic. When a social studies program focuses on local historj 
or local economic geographv. for example. New York, Illinois, and 
California will need completely distinct evaluation instruments. When 
particular state or local curricula operate with quite distinctive se- 
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qucnces for the prcsentalion of topics, a common appraisal mav make 
sense only at the end when all ha\e arri\eJ at nearly the same final 
destination. In some areas, at^least, it will be unreasonable to expect 
children to have learned what the\ ha\en't been taught. 

However, development and use of special testing instruments for 
local situations is not without its costs, and these cosis he not solely in 
the hours and dollars required to develop and' print the special tests. 
Though it will be possible to determine what propouion of children in 
a given school sssiem succeed on a specific test item or group of test 
items, this proportion will \ar) from low to high depending upon the 
basic difficulty of the item, in addition, to some cwteni. to its relation- 
ship to what has been taught and empharsized, and it will be ditlicult to 
know whelherone should be pleased or distressed b\ the percentages. 
Unless test exercise^ are limited to the simplest exemplars of the 
minimum essentials in vvhich case they will gi\e a very incomplete 
picture of the full range of learning ^vUight and to a d,egree achieved 
b\ the school— there will be var\ing proportions of children who will 
not be able to do an item. If the item tests the limits of skill or knowl- 
edge, the proportion who cannot manure it ma\ be quite high. Except 
ing as the items have been drawn from nationall) standardized tests 
for which item norms have been developed based on a representative 
sample ot school children, there will he no meaningful external basis 
for comparison. It will be difficult, if not mipossible, to determine 
whether high and low percentages of right answers are to be attributed 
to the successes and failures of the program or to the inherent ease or 
difTicuJtv of the test exercises, furthermore, it will be impossible to 
determine at what cost, in achievement of content and skills omitted 
Irom the local assessment instruments, any gams ui ihe objectives 
assessed in those instruments have been achieved. Il wiih of course, be 
, possible to make internal comparisons between communities within a 
stale, between schools within a communitN. between classes and pupils 
within a school. And the^e ma) be the comparisons that are relevant for 
decisions eoncernmti resource iillocation, concerning local shifts of em 
phasis or local remedial eOort. Fherc is dearly likelv to be some ad- 
vantage in having these internal compiirisons based i»n locally shared 
anu agreed-upon objectives. The issue is whether the gam is worth the 
cost. 

Some curricular and resource allocation decisKuis dearly call for 
fine-grained information at the level ol the item or the short subset of 
Items. Whether additional instruction on prime numbers is needed in the 
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5th grade can best be judged by knowing what percent of 5th graders 
think that 15 or-27or 9l are prime numbers. And whether one school 
needs to provide special help on identifjing the main idea of a para- 
graph depends on whether students ih that school show noticeably 
lower proportions of success on exercises requiring that skill than do 
students in other schools that recruit from a smiilar student population. 
Norms at the item le^veL which ha\ c been rather generally pru\ ided by 
test publishers during^ the past decade, provide valuable information 
in terms of which to make such jiidgmeots and decisions. We can expect 
that in the future publishers will continue to provide normative infor- 
mation not only on teM scores but upon items. But for broader assess- 
ments of relative success on the major segments within a skill or 
between skills, normati\e information on test scores will continue to 
be needed. 

Turning awa\ from achie\ement te>ts, I would assert that micro- 
analyses of successes and failures on specific test items make essentially 
no sense on tests developed to measure aspects of aptitude, as con- 
trasted with measures of achievements related to specific aspects of an 
educational program. I am talking here about analyses used as a* basis 
for decisions about persons, and not aboin decisions on the develop- 
ment and construction of tests, Ob\ loush , item analysis plays a central 
role in aptitude test construction. To know that on the Wech.sler ln« 
telligence Scale for Children (WISC) a 14-year-old got a full-scale IQ 
of 110 tells us something potentially meaningful about that child\s 
probable success m an algebra class. To know tliat the child got 14 of 
the 18 items oh the arithmetic test right ma) also be a useful datum. 
But to know the one specific fact that he got the correct answer on "36 
dollar^ at 4 dollars an hour** is of minimal help in our appraisal cither 
of his general scholastic aptitude, of his more specific quantitative 
ability, or of Jus likelihood of being a successful algebra student. Apti- 
tudes represent general areas of competence that ha\e no precise 
lateral boundaries and no upper limits. We appraise them by sampling 
broadly from some extended and ill-dcfmcd domain, oftea relating per 
forniance to that of others, siiiccour inferences arc predictive, and most 
of our predictions are inhcienily relative rather than absolute. Ifind it 
almost impossible toconccne how microanalysis uf single aptitude test 
items would coniributeany thing useful lu d^icisionsby or about persons. 
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Wisdom in relation to tc^t scores tails fo: niforniatiun on the prcdic^ 
tive significance of the test score in manv specific contexts. There is a 
big gap between knowledge in general of the \alidit) of Scholastic 
Aptitude Test (SAT) scores lur predicting college success, e\cn when 
that success is narrowlj defined as freshman grade point average, and 
knowledge of the proportion of the applicants u ith SAT- V >cores of 450 
who are admitted to Siwash, and the distribution of CPAs of those per- 
sons when they get there. The College Entrance Examination Board, 
the American College Vesting Program, and \arious of the state testing 
service groups have stcadilv increased their cflorts to make this t\pe of 
institution-specific information accessible be\ond the Miioke-fillcd 
offices of admissions directors not onl\ to school guidance >tairs but to 
the individual students who, m the final anaUsis, must make decisions 
about their own futures. 

A certain reluctance on the part of some insUtuiions to make the 
information a\ailablc is understandable. There is an clement of self- 
fulfilling prophec) in letting information about one's institutional past 
structure one's institutional future. But this us the type of information 
that is most directh rele\ant to decisions about whether or where to 
apply for admission. In all the settings in which test results are used for 
guidance decisions or personal dccisiuiis. improved communication 
systems are needed. for assembling and transmitting specific mfornia 
tion on the implications of those test results for the alternative cdiica- 
tional or vocationaf choices that are being faced. -< 

This concern points also to a basic problem that we *ilway.s face when 
wejif) to base selection or counseling or pcrM»nal decisions upon the 
data that we ha\e meticulousi) collected. lnc\itabK, these arc data 
from the past-sometimcs from the fairly remote past. \et we use them 
for decisions that relate to the future nnnctimes the fairl\ remote 
future, Forcx*miple, I^rujcct Talent s d/ncr Daut Book reports in 1973 
(he sorts of students tested m \9()0 who were m \tir.oiis occupation*il cate- 
gories fi\e) ears after the year in which the) wcrei>r would ha\c been in 
the twelfth grade The counselor in 1976 who u^cs laesc data is helping 
students to make decisions that will be operational in the I98()s. These, 
can be wise dccisKuis onl> t*) the extent that occupational opportunities 
and demands of 1980 match those ol 1965 to 1969 when Talent's 
students were making their occupational choices. I he *issumption may 
be reasonable, the world' changes fairl\ slowly. It is certainh nec- 
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cssar)'. The onlj uaj we can anticipate the future is b\ knowing the 
past, Bui It should also be recognised. 

Just as it is necessarv to use data from the past to make inferences 
and decisions about the future, and to assume a continuit) of the 
standards, conditions and relationships of the past into the future, so it 
is also necessar) to project relationships from one specific setting lo 
another. It is manifcstl) impossible to replicate empirical \alidation 
studies in every plant, office or school in which a testing procedure 
might be used. Time, numbers, availabilit) of sound performance esli* 
mates, as well as financial resources, all set limits on what can be done. 
So we must often use findings from other plants, olliees or schools and 
apply them to our present context. 

Yet our skills o! specifying the dimensions of similarity and dilFercncc 
between jobs in diflerent settings or mtcllcLtual demands in dillerent 
programs constitute a serious limitation on the confidence with which 
we can generalize relationships of predictors to performance, and* 
standards of acceptable performance in diflerent settings. There have 
been calls, within the field of \ocational psNchology. for .studies of the 
microstructureofjobs. and of the relationships of test scores to elements 
of that mierostructure. I am not aware that we have made great strides 
in that direction, and 1 am not sure w hat the p:fy oil' will be. 

But if we arc to generali/.e with any confidence from one academic 
or job setting to anotherjt may well be that some n)ore specific analyses 
of just what it is in a job that is predicted b) our test score> or other 
item^ of information about a person, so far as that is Lon^^fned will 
be essential. In the in ten m, we can unly maintain a discreet tentative- 
ness in our gcncraliAition of data to new situations. trNing as best we 
can to assess the degree of identitv between the setting of available 
data and the setting to which we would apph them. 

It is. alas, no eas) matter to translate information to wisdom. IVcts 
are not simple but complex, and \alues arc not uniform but diverse. 
We may. to quote another aphorism, conclude with Pope Alexander, 
not Paul that "a httle learning is a dangerous thin«/' \Vc ma) con ' 
elude, as some groups and organizations appear to lia\e done, that it is 
better lo forego mformation about the achievements and abilities of 
our .students individualU and vollectnc!) because of the possibility 
Chat we may use that mlormation unwiscl). We may abandon the 
attempt to understand better and [o leach others better to understand 
the implications of lest scores. We may elect tii remain blissfully igno- 
rant of the informatKUi that IcMn van gnc in the h4»pes that thus w c can 
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a » . * \0r \vc may conisnuc the struggle to understand, to appre- 
ciM^ ^Tiil t y be wise. 
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